<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Syntax Representation in Word Embeddings and Neural Networks - A Survey</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tomasz Limisiewicz</string-name>
          <email>limisiewicz@ufal.mff.cuni.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Marecˇek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University</institution>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>Neural networks trained on natural language processing tasks capture syntax even though it is not provided as a supervision signal. This indicates that syntactic analysis is essential to the understating of language in artificial intelligence systems. This overview paper covers approaches of evaluating the amount of syntactic information included in the representations of words for different neural network architectures. We mainly summarize research on English monolingual data on language modeling tasks and multilingual data for neural machine translation systems and multilingual language models. We describe which pre-trained models and representations of language are best suited for transfer to syntactic tasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Modern methods of natural language processing (NLP) are
based on complex neural network architectures, where
language units are represented in a metric space [
        <xref ref-type="bibr" rid="ref17 ref21 ref22 ref23 ref9">23, 28, 29,
9, 30</xref>
        ]. Such a phenomenon allows us to express linguistic
features (i.e., morphological, lexical, syntactic)
mathematically.
      </p>
      <p>
        The method of obtaining such representation and their
interpretations were described in multiple overview works.
Almeida and Xexéo surveyed different types of static word
embeddings [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and Liu et al. [
        <xref ref-type="bibr" rid="ref12">18</xref>
        ] focused on contextual
representations found in the most recent neural models.
Belinkov and Glass [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] surveyed the strategies of
interpreting latent representation. Best to our knowledge, we are
the first to focus on the syntactic and morphological
abilities of the word representations. We also cover the latest
approaches, which go beyond the interpretation of latent
vectors and analyze the attentions present in
state-of-theart Transformer models.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Vector Representations of Words</title>
      <p>
        This section introduces several types of architectures that
we will analyze in this work.
In the classical methods of language representation, each
word is assigned a vector regardless of its current context.
In the Latent Semantic Analysis [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the representation was
      </p>
      <p>Copyright c 2020 for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
obtained by counting word frequency across documents on
distinct subjects.</p>
      <p>
        In more recent approaches, a shallow neural network is
used to predict each word based on context (Word2Vec
[
        <xref ref-type="bibr" rid="ref17">23</xref>
        ]) or approximate the frequency of coocurence for a
pair of words (GloVe [
        <xref ref-type="bibr" rid="ref21">28</xref>
        ]). One explanation of the
effectiveness of these algorithms is the distributional hypothesis
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]: "words that occur in the same contexts tend to have
similar meanings".
2.2
      </p>
      <sec id="sec-2-1">
        <title>Contextual Word Vectors in Recurrent Networks</title>
        <p>The main disadvantage of the static word embeddings is
that they do not take into account the context of words.
This is especially an issue for languages rich in words that
have multiple meanings.</p>
        <p>
          The contextual embeddings introduced in [
          <xref ref-type="bibr" rid="ref22">29</xref>
          ] and [
          <xref ref-type="bibr" rid="ref16">22</xref>
          ]
are able to encode both words and their contexts. They are
based on recurrent neural networks (RNN) and are
typically trained on language modeling or machine translation
tasks using large text corpora. The outputs of the RNN
layers are context-dependent representations that are proven
to perform well when used as inputs for other NLP tasks
with much less training data available.
        </p>
        <p>
          Another improvement of context modeling was possible
thanks to the attention mechanism [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. It allowed passing
the information from the most relevant part of the RNN
encoder, instead of using only the contextual representation
of the last token.
2.3
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Contextual Representation in Transformers</title>
        <p>
          The most recent and widely used architecture is the
Transformer [
          <xref ref-type="bibr" rid="ref25">32</xref>
          ]. It consists of several (6 to 24) layers, and
each token position in each layer has the ability to attend
to any position in the previous layer using a self-attention
mechanism. Training such architecture can be easily
parallelized since individual tokens can be processed
independently; their positions are encoded within the input
embeddings. An example of visualization of attention
distribution computed in Transformer trained for language
modeling (BERT [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]) is presented in Figure 1.
        </p>
        <p>In addition to vectors, Transformer includes latent
representation in the form of self-attention weights, which are
two-dimensional matrices. We summarize the research on
the syntactic properties of attention weights in Section 5.
This sections describes the metrics used to evaluate
syntactic information captured by the word embeddings and
latent representation.
3.1</p>
      </sec>
      <sec id="sec-2-3">
        <title>Syntactic Analogies</title>
        <p>
          In the recent revival of word embeddings[
          <xref ref-type="bibr" rid="ref17 ref21">23, 28</xref>
          ], a strong
focus was put on examining the phenomenon of encoding
analogies in multidimensional space. That is to say, the
shift vector between pairs of analogous words is
approximately constant, e.g., the pairs drinking – drank, swimming
– swam in Figure 2.
        </p>
        <p>
          Syntactic analogies of this type are particularly relevant
for this overview. They include the following relations:
adjective – adverb; singular – plural; adjective –
comparative – superlative; verb – present participle – past
participle. The syntactic analogy is usually evaluated on Google
Analogy Test Set [
          <xref ref-type="bibr" rid="ref17">23</xref>
          ]. 1
        </p>
        <p>1The test set is called syntactic by authors; nevertheless, it mostly
focuses on morphological features.</p>
        <p>An evaluation example consists of two word pairs
represented by the embeddings: (v1; v2); (u1; u2). We compute
the analogy shift vector as the difference between
embeddings of the first pair s = v2 v1. The result is positive if
the nearest word embedding to the vector u1 + s is u2.</p>
        <p>
          WA = jf(v1; v2; u1; u2) : u2
jf(v1; v2; u1; u2)gj
Sequence tagging is a multiclass classification problem.
The aim is to predict the correct tag for each token of a
sequence. A typical example is the part of speech (POS)
tagging. The accuracy evaluation is straightforward: the
number of correctly assigned tags is divided by the number of
tokens.
The inference of reasonable syntactic structures from
word representations is the most challenging task
covered in our survey. There are attempts to predict both the
dependency[
          <xref ref-type="bibr" rid="ref24 ref7">12, 31, 15, 7</xref>
          ] and constituency trees [
          <xref ref-type="bibr" rid="ref15">21, 13</xref>
          ].
Dependency trees are evaluated using unlabeled
attachment score (UAS) or its undirected variant (UUAS):
UAS =
#correctly_attached_words
#all_words
(2)
The equation for Labeled Attachment Score is the same,
but it requires predicting a dependency label for each edge.
For constituency, trees we define precision (P) and recall
(R) for correctly predicted phrases.
        </p>
        <p>P =
#correct_phrases
#gold_phrases
;</p>
        <p>R =</p>
        <p>#correct_phrases
#predicted_phrases
(3)</p>
        <p>Usually, F1 is reported, which is a harmonic mean of
precision and recall.
3.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Attention’s Dependency Alignment</title>
        <p>
          In Section 5 we describe the examination of syntactic
properties of self-attention matrices. It can be evaluated
using Dependency Alignment [
          <xref ref-type="bibr" rid="ref27">34</xref>
          ] which sums the
attention weights at the positions corresponding to the pairs of
tokens forming a dependency edge in the tree.
        </p>
        <p>DepAlA =
å(i; j)2E Ai; j</p>
        <p>N N
åi=1 å j=1 Ai; j</p>
        <p>
          Dependency Accuracy [
          <xref ref-type="bibr" rid="ref28 ref7">35, 7, 15</xref>
          ] is an alternative
metric; for each dependency label it measures how often the
relation’s governor/dependent is the most attended token
by the dependent/governor.
        </p>
        <p>DepAccl;d;A = jf(i; j) 2 El;d : j = arg max Ai; gj
jEl;d j
(4)
(5)
Notation: E is a set of all dependency tree edges and El;d
is a subset of the edges with the label l and with direction
d, i.e., in dependent to governor direction the first element
of the tuple i is dependent of the relation and the second
element j is the governor; A is a self-attention matrix and
Ai; denotes ith row of the matrix; N is the sequence length.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Morphology and Syntax in Word</title>
    </sec>
    <sec id="sec-4">
      <title>Embeddings and Latent Vectors</title>
      <p>In this section, we summarize the research on the syntactic
information captured by vector representations of words.
We devote a significant attention to POS tagging, which
is a popular evaluation objective. Even though it is a
morphological task, it is highly relevant to syntactic analysis.
4.1</p>
      <sec id="sec-4-1">
        <title>Syntactic Analogies</title>
        <p>
          The first wave of research on the vector representation
of words focused on the statistical distribution of words
across distinct topics – Latent Semantic Analysis [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. It
captured statistical properties of words, yet there were no
positive results in syntactic analogies retrieval nor
encoding syntax.
        </p>
        <p>
          Google Analogy Test Set was released together with a
popular word embedding algorithm Word2Vec [
          <xref ref-type="bibr" rid="ref17">23</xref>
          ]. One
of the exceptional properties of this method was its high
accuracy in the analogy tasks. In particular, the best
configuration found the correct syntactic analogy in 68.9 % of
cases.
        </p>
        <p>
          The GloVe embeddings improved the results on
syntactic analogies to 69.3% [
          <xref ref-type="bibr" rid="ref21">28</xref>
          ]. Much more significant
improvement was reported for semantic analogies. They also
outperform the variety of other vectorization methods.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref18">24</xref>
          ] a simple recurrent neural network was trained
by language modeling objective. The word representation
is taken from the input layer. The evaluation from [
          <xref ref-type="bibr" rid="ref17">23</xref>
          ]
shows that Word2Vec performs better in syntactic
analogy task. This observation is surprising because
representations from RNN were proven effective in transfer to
other syntactic tasks (we elaborate on that in Sections 4.2
and 4.3). We think that possible explanations could be: 1.
the techniques of RNN training have crucially improved
in recent years; 2. syntactic analogy focuses on particular
words, while for other syntactic tasks, the context is more
important.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Part of Speech Tagging</title>
        <p>Measuring to what extent a linguistic feature such as POS
is captured in word representations is usually performed
by the method called probing. In probing, the parameters
of the pretrained network are fixed, the output word
representations are computed as in the inference mode and
then fed to a simple neural layer. Only this simple layer is
optimized for a new task.</p>
        <p>The number of probing experiments rose with the
advent of multilayer 2 RNNs trained for language modeling
and machine translation.</p>
        <p>
          Belinkov et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] probe a recurrent neural machine
translation (NMT) system with four layers to predict part
of speech tags (along with morphological features). They
use Arabic, Hebrew, French, German, and Czech to
English pairs. They observe that adding a character-based
representation computed by a convolutional neural
network in addition to word-embedding input is beneficial,
especially for morphologically rich languages.
        </p>
        <p>
          In a subsequent study [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], the source language of
translation now is English and the experiments are conducted
solely for this language. It is noted that the most
morphosyntactic representation is usually obtained in the
middle layers of the network.
        </p>
        <p>
          The influence of using a particular objective in
pretraining RNN model is comprehensively analyzed by
Blevins et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. They pre-train models on four objectives:
syntactic parsing, semantic role labeling, machine
translation, and language modeling. The two former objectives
may reveal morphosyntactic information to a larger extent
than other mentioned here settings. Particularly, the probe
of RNN syntactic parser achieves near-perfect accuracy in
part of speech tagging.
        </p>
        <p>
          The introduction of ELMo [
          <xref ref-type="bibr" rid="ref22">29</xref>
          ] brought a remarkable
advancement in transfer learning from the RNN language
model to a variety of other NLP tasks. The authors
examined POS capabilities of the representations and
compared the results with the neural machine translation
system CoVe [
          <xref ref-type="bibr" rid="ref16">22</xref>
          ], which also uses RNN architecture.
        </p>
        <p>
          Zhang et al. [
          <xref ref-type="bibr" rid="ref32">39</xref>
          ] perform further experiments with
CoVe and ELMo. They demonstrate that language
modeling systems are better suited to capture morphology and
syntax in the hidden states than machine translation, if
comparable amounts of data are used to train both systems.
Moreover, the corpora for language modeling are typically
more extensive than for machine translation, which can
further improve the results.
        </p>
        <p>Another comprehensive evaluation of morphological
and syntactic capabilities of language models was
conducted by Liu et al. [17]. Probing was applied to a language
model based on the Transformer architecture (BERT)
and compared with ELMo and static word embeddings
(Word2Vec). They observe that the hidden states of
Transformer do not demonstrate a major increase in probed POS
accuracy over the RNN model, even though it is more
complex and consists of a larger number of parameters.</p>
        <p>
          POS tag probing was also performed for languages other
than English. For instance, Musil [
          <xref ref-type="bibr" rid="ref19">25</xref>
          ] trains translation
systems (with RNN and Transformer architecture) from
Czech to English and examines the learned input
embeddings of the model and compares them to a Word2Vec
model trained on Czech.
        </p>
        <p>2Layer numbering in this work: We are numbering layers starting
from one for the layer closest to the input. Please note that original papers
may use different numbering.</p>
        <p>In Figures 3 and 4, we present a comparison of different
settings for POS tag probing. Each point denotes a pair of
results obtained in the same paper and the same dataset,
but with different types of embeddings or pretraining
objectives. Therefore, we can observe that the setting plotted
on the y-axis is better than the x-axis setting if the points
are above identity function (red dashed line). We cannot
say whether a method represented by another point
performs better, as the evaluation settings differ.</p>
        <p>Figure 4 clearly shows that the RNN contextualization
helps in part of speech tagging. As expected, the
information about neighboring tokens is essential to predict
morphosyntactic functions of words correctly. It is especially
true for the homographs, which can have various part of
speech in different places in the text.</p>
        <p>
          The influence of RNN’s pre-training task is presented
in Figure 3. Machine translation captures much better POS
information than auto-encoders, which can be interpreted
as translation from and to the same language. It is likely
that the latter task is straightforward and therefore does
not require to encode morphosyntax in the latent space.
The difference between the results of machine translation
and language modeling is small. Zhang et al. [
          <xref ref-type="bibr" rid="ref32">39</xref>
          ] show
that using a larger corpus for pre-training improves the
POS accuracy. The main advantage of language models is
that monolingual data is much easier to obtain than parallel
sentences necessary to train a machine translation system.
4.3
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Syntactic Structure Induction</title>
        <p>Extraction of dependency structure is more demanding
because instead of prediction for single tokens, every pair of
words need to be evaluated.</p>
        <p>
          Blevins et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] propose a feed-forward layer on top
of a frozen RNN representation to predict whether a
dependency tree edge connects a pair of tokens. They
concatenate the vector representation of each of the words and
their element-wise product. Such a representation is fed as
an input to the binary classifier. It only looks on a pair of
tokens at a time, therefore predicted edges may not form a
valid tree.
        </p>
        <p>Another approach, induction of the whole syntactic
structures from latent representations was proposed by
Hewitt and Manning [12]. Their syntactic probing is based on
training a matrix which is used to transform the output of
network’s layers (they use BERT and ELMo). The
objective of the probing is to approximate dependency tree
distances between tokens 3 by the L2 norm of the difference
of the transformed vectors. Probing produces the
approximate syntactic pairwise distances for each pair of tokens.
The minimum spanning tree algorithm is used on the
distance matrix to find the undirected dependency tree. The
best configuration employs the 15th layer of BERT large
and induces treebank with 82.5% UAS on Penn Treebank
with Stanford Dependency annotation (relation directions
and punctuation were disregarded in the experiments). The
3Tree distance is the length of the tree path between two tokens
result for BERT is significantly higher than for ELMo,
which gave 77.0% when the first layer was probed.</p>
        <p>The paper also describes an alternative method of
approximating the syntactic depth by the L2 norm of
latent vector multiplied by a trainable matrix. The estimated
depths allow prediction of the root of a sentence with
90.1% accuracy when representation from the 16th layer
of BERT large is probed.
4.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Multilingual Representations</title>
        <p>
          The subsequent paper by Chi et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] applies the
setting from [12] to the multilingual language model mBERT.
They train syntactic distance probes on 11 languages and
compare UAS of induced trees in four scenarios: 1.
training and evaluating on the same languages; 2. training on
a single language, evaluating on a different one; 3.
training on all languages except the evaluation one; 4.
training on all languages, including the evaluation one. They
demonstrate that the transfer is effective as the results in all
the configurations outperform the baselines4. Even in the
hardest case – zero-shot transfer from just one language,
the result is at least 6.9 percent points above the
baselines (for Chinese). Nevertheless, for all the languages, no
transfer-learning setting can beat the training and
evaluating a probe on the same language.
        </p>
        <p>The paper includes analysis of intrinsic features of the
BERT’s vectors transformed by a probe. Noticeably, the
vector differences between the representations of words
connected by dependency relation are clustered by relation
labels, see figure 5.</p>
        <p>
          Multilingual BERT embeddings are also analyzed by
Wang et al. [
          <xref ref-type="bibr" rid="ref29">36</xref>
          ]. They show that even for the multilingual
vectors, the results can be improved by projecting vector
spaces across languages. They use Biaffine Graph-based
Parser by Dozat and Manning [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], which consists of
multiple RNN layers. Therefore, the experiment is not strictly
comparable with probing as the most of syntactic
information is captured by the parser, and not by the embeddings.
The article compares different types of vector
representations fed as an input to the parser. It is demonstrated that
cross-lingual transformation on mBERT embedding
improves the results significantly in LAS of parser trained
on English and evaluated on 14 languages (including
English); on average, from 60.53% to 63.54%. In
comparison to other cross-lingual representations, the proposed
method outperforms transformed static embeddings
(FastText with SVD) and also slightly outperforms contextual
embeddings (XLM).
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Syntax in Transformer’s Attention</title>
    </sec>
    <sec id="sec-6">
      <title>Matrices</title>
      <p>
        Besides the vector representations of individual tokens,
the Transformer architecture offers another representation
4There are two baselines: right-branching tree and probing on
randomly initialized mBERT without pretraining
with a possible syntactic interpretation – the weights of the
self-attention heads. In each head, information can flow
from each token to any other one. These connections may
be easily analyzed and compared to syntactic relations
proposed by linguists. In this section, we will summarize
different approaches of extracting syntax from attention. We
present the methods both for dependency and constituency
structures.
Raganato and Tiedemann [
        <xref ref-type="bibr" rid="ref24">31</xref>
        ] induce dependency trees
from self-attention matrices of a neural machine
translation encoder. They use the maximum spanning tree
algorithm to connect pairs of tokens with high attention. Gold
root information is used to find the direction of the edges.
Trees extracted in this way are generally worse than the
right-branching baseline (35.08 % UAS on PUD) and
outperform it slightly in a few heads. The maximum UAS
is obtained when a dependency structure is induced from
one head of the 5th layer of English to Chinese encoder
- 38.87% UAS. Nevertheless, their approach assumes that
the whole syntactic tree may be induced from just one
attention head.
      </p>
      <p>
        Recent articles focused on the analysis of features and
classification of Transformer’s self-attention heads. Vig
and Belinkov [
        <xref ref-type="bibr" rid="ref27">34</xref>
        ] apply multiple metrics to examine
properties of attention matrices computed in a unidirectional
language model (GPT-2 [
        <xref ref-type="bibr" rid="ref23">30</xref>
        ]). They showed that in some
heads, the attentions concentrate on tokens representing
specific POS tags and the pairs of tokens are more often
attended one to another if an edge in the dependency tree
      </p>
      <sec id="sec-6-1">
        <title>Raganato and Tiedemann 2019 [31] Vig and Belinkov 2019 [34]</title>
        <p>
          Clark et al. 2019 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
Voita et al. 2019 [
          <xref ref-type="bibr" rid="ref28">35</xref>
          ]
NMT Encoder
(6 layers 8 heads)
LM (GPT-2)
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>LM (BERT)</title>
      </sec>
      <sec id="sec-6-3">
        <title>NMT Encoder (6 layers 8 heads)</title>
      </sec>
      <sec id="sec-6-4">
        <title>Limisiewicz et al. 2020 [15] LMs (BERT, mBERT)</title>
      </sec>
      <sec id="sec-6-5">
        <title>Marecˇek and Rosa 2019</title>
        <p>
          [
          <xref ref-type="bibr" rid="ref15">21</xref>
          ]
Kim et al. 2019 [13]
NMT Encoder
(6 layers 16 heads)
LMs (BERT, GPT2,
RoBERTa, XLNet)
Dependency
        </p>
      </sec>
      <sec id="sec-6-6">
        <title>Tree induction PUD [27] Syntactic evaluation</title>
        <p>connects them, i.e., dependency alignment is high. They
observe that the strongest dependency alignment occurs in
the middle layers of the model – 4th and 5th. They also
point that different dependency types (labels) are captured
in different places of the model. Attention in upper
layers aligns more with subject relations whereas in the lower
layer with modifying relations, such as auxiliaries,
determiners, conjunctions, and expletives.</p>
        <p>
          Voita et al. [
          <xref ref-type="bibr" rid="ref28">35</xref>
          ] also observed alignment with
dependency relations in the encoders of neural machine
translation systems from English to Russian, German, or French.
They have evaluated dependency accuracy for four
dependency labels: noun subject, direct object, adjective
modifier, and adverbial modifier. They separately address the
cases where a verb attends to a dependent subject, and
subject attends to governor verb. The heads with more than
10% improvement over a positional baseline are identified
as syntactic 6. Such heads are found in all encoder
layers except the first one. In further experiments, the authors
propose the algorithm to prune the heads from the model
with a minimal decrease in translation performance.
During pruning, the share of syntactic heads rises from 17%
in the original model to 40% when 75% heads are cut out,
while a change in translation score is negligible. These
results support the claim that the model’s ability to
capture syntax is essential to its performance in non-syntactic
tasks.
        </p>
        <p>
          A similar evaluation of dependency accuracy for the
BERT language model was conducted by Clark et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>5A head is syntactic when the tree extracted from it surpasses the
right-branching chain in terms of UAS. It is a strong baseline for syntactic
trees in English. Thus only a few heads are recognized as syntactic.</p>
        <p>6In the positional baseline, the most frequent offset is added to the
index of relation’s dependent/governor to find its governor/dependent, e.g.,
for adjective to noun relations the most frequent offset is +1 in English
They identify syntactic heads that significantly outperform
positional baseline for the following labels: prepositional
object, determiner, direct object, possession modifier,
auxiliary passive, clausal component, marker, phrasal verb
particle. The syntactic heads are found in the middle layers
(4th to 8th). However, there is no single head that would
capture the information for all the relations.</p>
        <p>
          In another experiment, Clark et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] induce a
dependency tree from attentions. Instead of extracting structure
from each head [
          <xref ref-type="bibr" rid="ref24">31</xref>
          ] they use probing to find the weighted
average of all heads. The maximum spanning tree
algorithm is used to induce the dependency structure from the
average. This approach produces trees with 61% UAS and
can be improved to 77% by making weights dependent on
the static word representation (fixed GloVe vectors). Both
the numbers are significantly higher than right branching
baseline 27%.
        </p>
        <p>A related analysis for English (BERT) and the
multilingual variant (mBERT) was conducted by Limisiewicz et
al. [15]. We have observed that the information about one
dependency type is split across many self-attention heads
and in other cases, the opposite happens - many heads have
the same syntactic function. They extract labeled
dependency trees from the averaged heads and achieves 52%
UAS and show that in the multilingual model (mBERT)
specific relation (noun subject, determines) are found in
the same heads across typologically similar languages.
5.2</p>
        <sec id="sec-6-6-1">
          <title>Constituency trees</title>
          <p>There are fewer papers devoted to deriving constituency
syntax tree structures.</p>
          <p>
            Marecˇek and Rosa [
            <xref ref-type="bibr" rid="ref15">21</xref>
            ] examined the encoder of the
machine translation system for translation between
English, French, and German. We observed that in some
there
          </p>
          <p>is
considerable
energy
saving
potential</p>
          <p>in
public
2PD beuxiladminpglfesor,
DAOM faciwwlitohauittcelhdhe ,
transition
towards</p>
          <p>a
stable</p>
          <p>,
green
economy
P
2
D
J
B
O
there</p>
          <p>is
considerable
energy
saving
potential</p>
          <p>in
public
buildings</p>
          <p>,
for
example</p>
          <p>,
which
would
facilitate</p>
          <p>the
transition
towards</p>
          <p>a
stable</p>
          <p>,
green
economy
.
.</p>
          <p>X</p>
          <p>X</p>
          <p>X</p>
          <p>X
X
X</p>
          <p>LAYER: 8 HEAD: 10</p>
          <p>X
e is n
th lreab reengy isanvg ittlean i ilcub isng
r
e</p>
          <p>p ld
po ibu
e
d
i
s
n
o
c
, fro lep , ichh ludo ittea teh iitsno rsadwa ltseab , reegn oynm.</p>
          <p>
            aexm wwlifac tran to ceo
heads, stretches of words attend to the same token
forming shapes similar to balustrades (Figure 7). Furthermore,
those stretches usually overlap with syntactic phrases. This
notion is employed in the new method for constituency tree
induction. In their algorithm, the weights for each stretch
of tokens are computed by summing the attention focused
on the balustrades and then inducing a constituency tree
with CKY algorithm [
            <xref ref-type="bibr" rid="ref20">26</xref>
            ]. As a result, we produce trees
that achieve up to 32.8% F1 score for English sentences,
43.6% for German and 44.2% for French. 7 The results can
be improved by selecting syntactic heads and using only
them in the algorithm. This approach requires a sample of
100 annotated sentences for head selection and raises F1
7The evaluation was done on 1000 sentences for each language
parsed with supervised Stanford Parsed
huge
areas
covering
thousands of
hectares of
vineyards
          </p>
          <p>have
burbneeedn
meatthnhasist ;
the
vine
growers</p>
          <p>have
suffered
loss
and
that
their
plants</p>
          <p>have
damagbeeden .
by up to 8.10 percent points in English.</p>
          <p>
            The extraction of constituency trees from language
models was described by Kim et al. [13]. They present
a comprehensive study that covers nine types of
pretrained networks: BERT (base, large), GPT-2 [
            <xref ref-type="bibr" rid="ref23">30</xref>
            ]
(original, medium), RoBERTa [
            <xref ref-type="bibr" rid="ref13">19</xref>
            ] (base, large), XLNet [
            <xref ref-type="bibr" rid="ref31">38</xref>
            ]
(base, large). Their approach is based on computing
distance between each pair of subsequent words. In each step,
they are branching the tree in the place where the distance
is the highest. The authors try three distance measures on
the vector outputs of the encoder layer (cosine, L1, and L2
distances for pairs of vectors) and two distance measures
on the distributions of token’s attention (Jason-Shannon
and Hellinger distances for pairs of distribution). In the
former case, distances are computed only per layer and in
the latter case for each head and average of heads in one
layer. The best setting achieves 40.1% F1 score on WSJ
Penn Treebank. It uses XLNet-base and Helinger distance
on averaged attentions in the 7th layer. Generally, attention
distribution distances perform better than vector ones.
Authors also observe that models trained on regular language
modeling objective (i.e., next word prediction in GPT,
XLNet) captured syntax better than masked language models
(BERT, RoBERTa). In line with the previous research, the
middle layers tend to be more syntactic.
Figure 8 summarizes the evaluation of syntactic
information across layers for different approaches. In
Transformerbased language models: BERT, mBERT, and GPT-2, the
middle layers are the most syntactic. In neural machine
translation models, the top layers of the encoder are the
most syntactic. However, it is important to note that the
2
1
1
1
0
1
9
A)BERTlaBrgEeR1T-1l2arge13-24
ae
s
Tb
R
          </p>
          <p>E
)B
B
ae
s
Tb
R
E</p>
          <p>B
)m
C
2
T</p>
          <p>P
)G
D
ae
s
Tb
R</p>
          <p>E
)B
E</p>
          <p>e
cs n2d
en2 e
T</p>
          <p>M
)N
F
t
e
2
n
e
2i
f
n
e
2u
r
n
e
2r
t
n
e
h
z
2
n
e
o
M
L
)E
G
NMT Transformer encoder is only the first half of the
whole translation architecture, and therefore the most
syntactic layers are, in fact, in the middle of the process. In
RNN language model (ELMo) the first layer is more
syntactic than the second one.</p>
          <p>We conjecture that the initial Transformer’s layers
capture simple relations (e.g., attending to next or previous
tokens) and the last layers mostly capture task-specific
information. Therefore, they are less syntactic.</p>
          <p>
            We also observe that in supervised probing [
            <xref ref-type="bibr" rid="ref6">12, 6</xref>
            ],
better results are obtained from initial and top layers than in
unsupervised structure induction [
            <xref ref-type="bibr" rid="ref24">31, 15</xref>
            ], i.e., the
distribution across layers is smoother.
6
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>In this overview, we survey that syntactic structures are
latently learned by the neural models for natural language
processing tasks. We have compared multiple approaches
of others and described the features that affect the ability to
capture the syntax. The following aspects tend to improve
the performance on syntactic tasks such as POS tagging:
1. Using contextual embeddings from RNNs or
Transformer outperforms static word embeddings
(Word2Vec, GloVe).
2. Pretraining on tasks with masked input (language
modeling or machine translation) produces better
syntactic representation than auto encoding.
3. The advantage of language modeling over machine
translation is the fact that larger corpora are available
for pretraining.</p>
      <p>Our meta-analysis of latent states showed that the most
syntactic representation could be found in the middle
layers of the model. They tend to capture more complex
relations than initial layers, and the representations are less
dependent on the pretraining objectives than in the top
layers.</p>
      <p>We have shown to what extent systems trained for a
nonsyntactic task can learn grammatical structures. The
question we leave for further research is whether providing
explicit syntactic information to the model can improve its
performance on other NLP tasks.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work has been supported by the grant 18-02196S of
the Czech Science Foundation. It has been using language
resources and tools developed, stored and distributed by
theLINDAT/CLARIAH-CZ project of the Ministry of
Education, Youth and Sports of the Czech Republic (project
LM2018101).</p>
      <p>Word,
[12] John Hewitt and Christopher D. Manning. A
structural probe for finding syntax in word
representations. In NAACL-HLT, 2019.
[13] Taeuk Kim, Jihun Choi, Daniel Edmiston, and
Sanggoo Lee. Are Pre-trained Language Models Aware of
Phrases? Simple but Strong Baselines for Grammar
Induction. In International Conference on Learning
Representations, January 2020.
[14] Philipp Koehn. Europarl: A parallel corpus for
statistical machine translation. 5, 11 2004.
[15] Tomasz Limisiewicz, Rudolf Rosa, and David
Marecˇek. Universal dependencies according to
BERT: both more specific and more general. ArXiv,
abs/2004.14620, 2020.
[16] Pierre Lison, Jo¨rg Tiedemann, and Milen Kouylekov.</p>
      <p>OpenSubtitles2018: Statistical rescoring of sentence
alignments in large, noisy parallel corpora. In
Proceedings of the Eleventh International Conference on
Language Resources and Evaluation (LREC 2018),
Miyazaki, Japan, May 2018. European Language
Resources Association (ELRA).
[17] Nelson F. Liu, Matt Gardner, Yonatan Belinkov,
Matthew E. Peters, and Noah A. Smith. Linguistic
knowledge and transferability of contextual
representations. In NAACL-HLT, 2019.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Felipe</given-names>
            <surname>Almeida</surname>
          </string-name>
          and
          <string-name>
            <given-names>Geraldo</given-names>
            <surname>Xexéo</surname>
          </string-name>
          .
          <article-title>Word embeddings: A survey</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1901</year>
          .09069,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Dzmitry</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <surname>Yoshua Bengio.</surname>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          .
          <source>CoRR, abs/1409.0473</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Yonatan</given-names>
            <surname>Belinkov</surname>
          </string-name>
          , Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and
          <string-name>
            <given-names>James</given-names>
            <surname>Glass</surname>
          </string-name>
          .
          <article-title>What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</article-title>
          , pages
          <fpage>861</fpage>
          -
          <lpage>872</lpage>
          , Vancouver, Canada,
          <year>July 2017</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Yonatan</given-names>
            <surname>Belinkov</surname>
          </string-name>
          , Lluís Màrquez, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and
          <string-name>
            <given-names>James</given-names>
            <surname>Glass</surname>
          </string-name>
          .
          <article-title>Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks</article-title>
          .
          <source>In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          , Taipei, Taiwan,
          <year>November 2017</year>
          .
          <article-title>Asian Federation of Natural Language Processing</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Terra</given-names>
            <surname>Blevins</surname>
          </string-name>
          ,
          <string-name>
            <surname>Omer Levy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          .
          <article-title>Deep RNNs encode soft hierarchical syntax</article-title>
          .
          <source>In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</source>
          , pages
          <fpage>14</fpage>
          -
          <lpage>19</lpage>
          , Melbourne, Australia,
          <year>July 2018</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Ethan</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chi</surname>
            , John Hewitt, and
            <given-names>Christopher D.</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Finding universal grammatical relations in multilingual BERT</article-title>
          .
          <source>In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          , pages
          <fpage>5564</fpage>
          -
          <lpage>5577</lpage>
          , Online,
          <year>July 2020</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Clark</surname>
          </string-name>
          , Urvashi Khandelwal,
          <string-name>
            <surname>Omer Levy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>What does BERT look at? An analysis of BERT's attention</article-title>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Scott</given-names>
            <surname>Deerwester</surname>
          </string-name>
          , Susan T. Dumais, George W. Furnas,
          <string-name>
            <surname>Thomas</surname>
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Landauer</surname>
          </string-name>
          , and Richard Harshman.
          <article-title>Indexing by latent semantic analysis</article-title>
          .
          <source>Journal of the American Society for Information Science</source>
          ,
          <volume>41</volume>
          (
          <issue>6</issue>
          ):
          <fpage>391</fpage>
          -
          <lpage>407</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          . Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In NAACL-HLT</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Timothy</given-names>
            <surname>Dozat</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Deep biaffine attention for neural dependency parsing</article-title>
          .
          <source>In 5th International Conference on Learning Representations, ICLR</source>
          <year>2017</year>
          , Toulon, France,
          <source>April 24-26</source>
          ,
          <year>2017</year>
          , Conference Track Proceedings,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Zellig</given-names>
            <surname>Harris</surname>
          </string-name>
          . Distributional structure.
          <volume>10</volume>
          (
          <issue>23</issue>
          ):
          <fpage>146</fpage>
          -
          <lpage>162</lpage>
          ,
          <year>1954</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Qi</surname>
            <given-names>Liu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Matt J.</given-names>
            <surname>Kusner</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Phil</given-names>
            <surname>Blunsom</surname>
          </string-name>
          .
          <article-title>A survey on contextual embeddings</article-title>
          .
          <source>ArXiv</source>
          , abs/
          <year>2003</year>
          .07278,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Yinhan</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
          <string-name>
            <surname>Omer Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mike</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Luke</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Veselin</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          .
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          . arXiv preprint arXiv:
          <year>1907</year>
          .11692,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Mitchell</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Marcus</surname>
          </string-name>
          , Beatrice Santorini, and Mary Ann Marcinkiewicz.
          <article-title>Building a large annotated corpus of English: The Penn Treebank</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>19</volume>
          (
          <issue>2</issue>
          ):
          <fpage>313</fpage>
          -
          <lpage>330</lpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [21]
          <string-name>
            <surname>David Marecˇek</surname>
            and
            <given-names>Rudolf</given-names>
          </string-name>
          <string-name>
            <surname>Rosa</surname>
          </string-name>
          .
          <article-title>From balustrades to pierre vinken: Looking for syntax in transformer self-attentions</article-title>
          .
          <source>In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</source>
          , pages
          <fpage>263</fpage>
          -
          <lpage>275</lpage>
          , Florence, Italy,
          <year>August 2019</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Bryan</surname>
            <given-names>McCann</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>James</given-names>
            <surname>Bradbury</surname>
          </string-name>
          , Caiming Xiong, and Richard Socher.
          <article-title>Learned in translation: Contextualized word vectors</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>6297</fpage>
          -
          <lpage>6308</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Efficient estimation of word representations in vector space</article-title>
          .
          <source>CoRR, abs/1301</source>
          .3781,
          <year>July 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Wen-tau
          <string-name>
            <surname>Yih</surname>
            , and
            <given-names>Geoffrey</given-names>
          </string-name>
          <string-name>
            <surname>Zweig</surname>
          </string-name>
          .
          <article-title>Linguistic regularities in continuous space word representations</article-title>
          .
          <source>In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , pages
          <fpage>746</fpage>
          -
          <lpage>751</lpage>
          , Atlanta, Georgia,
          <year>June 2013</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Tomásˇ</given-names>
            <surname>Musil</surname>
          </string-name>
          .
          <article-title>Examining Structure of Word Embeddings with PCA</article-title>
          . In Text, Speech, and Dialogue, pages
          <fpage>211</fpage>
          -
          <lpage>223</lpage>
          . Springer International Publishing,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ney</surname>
          </string-name>
          .
          <article-title>Dynamic programming parsing for contextfree grammars in continuous speech recognition</article-title>
          .
          <source>IEEE Transactions on Signal Processing</source>
          ,
          <volume>39</volume>
          (
          <issue>2</issue>
          ):
          <fpage>336</fpage>
          -
          <lpage>340</lpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Jeffrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          . Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In Empirical Methods in Natural Language Processing (EMNLP)</source>
          , pages
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Matthew</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Peters</surname>
            , Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,
            <given-names>Kenton</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>and Luke</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
          </string-name>
          .
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , New Orleans, Louisiana,
          <year>June 2018</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Alec</surname>
            <given-names>Radford</given-names>
          </string-name>
          , Jeff Wu, Rewon Child, David Luan,
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          .
          <article-title>Language models are unsupervised multitask learners</article-title>
          .
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>Alessandro</given-names>
            <surname>Raganato</surname>
          </string-name>
          and
          <article-title>Jo¨rg Tiedemann. An analysis of encoder representations in transformer-based machine translation</article-title>
          .
          <source>In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</source>
          , pages
          <fpage>287</fpage>
          -
          <lpage>297</lpage>
          , Brussels, Belgium,
          <year>November 2018</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Ashish</surname>
            <given-names>Vaswani</given-names>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
          <string-name>
            <given-names>Aidan N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Lukasz Kaiser, and
          <string-name>
            <given-names>Illia</given-names>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems</source>
          <year>2017</year>
          ,
          <fpage>4</fpage>
          -9
          <source>December</source>
          <year>2017</year>
          , Long Beach, CA, USA, pages
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>Jesse</given-names>
            <surname>Vig</surname>
          </string-name>
          .
          <article-title>A multiscale visualization of attention in the transformer model</article-title>
          .
          <source>In Proceedings of the 57th Conference of the Association for Computational Linguistics</source>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          <year>2019</year>
          , Florence, Italy,
          <source>July 28 - August 2</source>
          ,
          <year>2019</year>
          , Volume
          <volume>3</volume>
          :
          <string-name>
            <given-names>System</given-names>
            <surname>Demonstrations</surname>
          </string-name>
          , pages
          <fpage>37</fpage>
          -
          <lpage>42</lpage>
          . Association for Computational Linguistics,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>Jesse</given-names>
            <surname>Vig</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yonatan</given-names>
            <surname>Belinkov</surname>
          </string-name>
          .
          <article-title>Analyzing the Structure of Attention in a Transformer Language Model</article-title>
          .
          <source>In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</source>
          , pages
          <fpage>63</fpage>
          -
          <lpage>76</lpage>
          , Florence, Italy,
          <year>August 2019</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [35]
          <string-name>
            <surname>Elena</surname>
            <given-names>Voita</given-names>
          </string-name>
          , David Talbot,
          <string-name>
            <given-names>Fedor</given-names>
            <surname>Moiseev</surname>
          </string-name>
          , Rico Sennrich, and
          <string-name>
            <given-names>Ivan</given-names>
            <surname>Titov</surname>
          </string-name>
          .
          <article-title>Analyzing multi-head selfattention: Specialized heads do the heavy lifting, the rest can be pruned</article-title>
          .
          <source>In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          , pages
          <fpage>5797</fpage>
          -
          <lpage>5808</lpage>
          , Florence, Italy,
          <year>July 2019</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [36]
          <string-name>
            <surname>Yuxuan</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Wanxiang Che, Jiang Guo, Yijia Liu, and Ting Liu.
          <article-title>Cross-lingual bert transformation for zero-shot dependency parsing</article-title>
          .
          <source>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>Adina</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Nikita</given-names>
            <surname>Nangia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Samuel</given-names>
            <surname>Bowman</surname>
          </string-name>
          .
          <article-title>A broad-coverage challenge corpus for sentence understanding through inference</article-title>
          .
          <source>In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (
          <issue>Long Papers)</issue>
          , pages
          <fpage>1112</fpage>
          -
          <lpage>1122</lpage>
          , New Orleans, Louisiana,
          <year>June 2018</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [38]
          <string-name>
            <surname>Zhilin</surname>
            <given-names>Yang</given-names>
          </string-name>
          , Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          . Xlnet:
          <article-title>Generalized autoregressive pretraining for language understanding</article-title>
          .
          <source>In NeurIPS</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [39]
          <string-name>
            <surname>Kelly</surname>
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <surname>Samuel R. Bowman</surname>
          </string-name>
          .
          <article-title>Language modeling teaches you more syntax than translation does: Lessons learned through auxiliary task analysis</article-title>
          .
          <source>In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</source>
          ,
          <year>November 2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>