=Paper=
{{Paper
|id=Vol-2718/paper16
|storemode=property
|title=Syntax Representation in Word Embeddings and Neural Networks – A Survey
|pdfUrl=https://ceur-ws.org/Vol-2718/paper16.pdf
|volume=Vol-2718
|authors=Tomasz Limisiewicz,David Mareček
|dblpUrl=https://dblp.org/rec/conf/itat/LimisiewiczM20
}}
==Syntax Representation in Word Embeddings and Neural Networks – A Survey==
Syntax Representation in Word Embeddings and Neural Networks – A Survey
Tomasz Limisiewicz and David Mareček
Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University
{limisiewicz,marecek}@ufal.mff.cuni.cz
Abstract: Neural networks trained on natural language obtained by counting word frequency across documents on
processing tasks capture syntax even though it is not pro- distinct subjects.
vided as a supervision signal. This indicates that syntactic In more recent approaches, a shallow neural network is
analysis is essential to the understating of language in ar- used to predict each word based on context (Word2Vec
tificial intelligence systems. This overview paper covers [23]) or approximate the frequency of coocurence for a
approaches of evaluating the amount of syntactic informa- pair of words (GloVe [28]). One explanation of the effec-
tion included in the representations of words for different tiveness of these algorithms is the distributional hypothesis
neural network architectures. We mainly summarize re- [11]: "words that occur in the same contexts tend to have
search on English monolingual data on language modeling similar meanings".
tasks and multilingual data for neural machine translation
systems and multilingual language models. We describe
which pre-trained models and representations of language 2.2 Contextual Word Vectors in Recurrent Networks
are best suited for transfer to syntactic tasks.
The main disadvantage of the static word embeddings is
1 Introduction that they do not take into account the context of words.
This is especially an issue for languages rich in words that
Modern methods of natural language processing (NLP) are have multiple meanings.
based on complex neural network architectures, where lan- The contextual embeddings introduced in [29] and [22]
guage units are represented in a metric space [23, 28, 29, are able to encode both words and their contexts. They are
9, 30]. Such a phenomenon allows us to express linguistic based on recurrent neural networks (RNN) and are typi-
features (i.e., morphological, lexical, syntactic) mathemat- cally trained on language modeling or machine translation
ically. tasks using large text corpora. The outputs of the RNN lay-
The method of obtaining such representation and their ers are context-dependent representations that are proven
interpretations were described in multiple overview works. to perform well when used as inputs for other NLP tasks
Almeida and Xexéo surveyed different types of static word with much less training data available.
embeddings [1], and Liu et al. [18] focused on contextual Another improvement of context modeling was possible
representations found in the most recent neural models. thanks to the attention mechanism [2]. It allowed passing
Belinkov and Glass [4] surveyed the strategies of interpret- the information from the most relevant part of the RNN en-
ing latent representation. Best to our knowledge, we are coder, instead of using only the contextual representation
the first to focus on the syntactic and morphological abil- of the last token.
ities of the word representations. We also cover the latest
approaches, which go beyond the interpretation of latent
vectors and analyze the attentions present in state-of-the- 2.3 Contextual Representation in Transformers
art Transformer models.
The most recent and widely used architecture is the Trans-
former [32]. It consists of several (6 to 24) layers, and
2 Vector Representations of Words each token position in each layer has the ability to attend
to any position in the previous layer using a self-attention
This section introduces several types of architectures that
mechanism. Training such architecture can be easily paral-
we will analyze in this work.
lelized since individual tokens can be processed indepen-
dently; their positions are encoded within the input em-
2.1 Static Word Embeddings beddings. An example of visualization of attention distri-
bution computed in Transformer trained for language mod-
In the classical methods of language representation, each eling (BERT [9]) is presented in Figure 1.
word is assigned a vector regardless of its current context. In addition to vectors, Transformer includes latent rep-
In the Latent Semantic Analysis [8], the representation was resentation in the form of self-attention weights, which are
Copyright c 2020 for this paper by its authors. Use permitted under two-dimensional matrices. We summarize the research on
Creative Commons License Attribution 4.0 International (CC BY 4.0). the syntactic properties of attention weights in Section 5.
An evaluation example consists of two word pairs rep-
resented by the embeddings: (v1 , v2 ), (u1 , u2 ). We compute
the analogy shift vector as the difference between embed-
dings of the first pair s = v2 − v1 . The result is positive if
the nearest word embedding to the vector u1 + s is u2 .
|{(v1 , v2 , u1 , u2 ) : u2 ≈ u1 + v2 − v1 }|
WA = (1)
|{(v1 , v2 , u1 , u2 )}|
3.2 Sequence Tagging
Sequence tagging is a multiclass classification problem.
The aim is to predict the correct tag for each token of a se-
Figure 1: Visualization of attention mechanism in Trans-
quence. A typical example is the part of speech (POS) tag-
former architecture. It shows which parts of the text are
ging. The accuracy evaluation is straightforward: the num-
important to compute the representation for the word “to”.
ber of correctly assigned tags is divided by the number of
Created in BertViz framework [33].
tokens.
3.3 Syntactic structure prediction
The inference of reasonable syntactic structures from
word representations is the most challenging task cov-
ered in our survey. There are attempts to predict both the
dependency[12, 31, 15, 7] and constituency trees [21, 13].
Dependency trees are evaluated using unlabeled attach-
ment score (UAS) or its undirected variant (UUAS):
#correctly_attached_words
UAS = (2)
#all_words
Figure 2: Spatial distribution of word embeddings de- The equation for Labeled Attachment Score is the same,
pends on syntactic roles of words (visualization created by but it requires predicting a dependency label for each edge.
Ashutosh Singh). For constituency, trees we define precision (P) and recall
(R) for correctly predicted phrases.
3 Measures of Syntactic Information #correct_phrases #correct_phrases
P= , R= (3)
#gold_phrases #predicted_phrases
This sections describes the metrics used to evaluate syn-
tactic information captured by the word embeddings and Usually, F1 is reported, which is a harmonic mean of
latent representation. precision and recall.
3.1 Syntactic Analogies 3.4 Attention’s Dependency Alignment
In the recent revival of word embeddings[23, 28], a strong In Section 5 we describe the examination of syntactic
focus was put on examining the phenomenon of encoding properties of self-attention matrices. It can be evaluated
analogies in multidimensional space. That is to say, the using Dependency Alignment [34] which sums the atten-
shift vector between pairs of analogous words is approxi- tion weights at the positions corresponding to the pairs of
mately constant, e.g., the pairs drinking – drank, swimming tokens forming a dependency edge in the tree.
– swam in Figure 2.
Syntactic analogies of this type are particularly relevant ∑(i, j)∈E Ai, j
DepAlA = (4)
for this overview. They include the following relations: ad- ∑Ni=1 ∑Nj=1 Ai, j
jective – adverb; singular – plural; adjective – compara-
tive – superlative; verb – present participle – past partici- Dependency Accuracy [35, 7, 15] is an alternative met-
ple. The syntactic analogy is usually evaluated on Google ric; for each dependency label it measures how often the
Analogy Test Set [23]. 1 relation’s governor/dependent is the most attended token
by the dependent/governor.
1 The test set is called syntactic by authors; nevertheless, it mostly |{(i, j) ∈ El,d : j = arg max Ai,· }|
DepAccl,d,A = (5)
focuses on morphological features. |El,d |
Notation: E is a set of all dependency tree edges and El,d The number of probing experiments rose with the ad-
is a subset of the edges with the label l and with direction vent of multilayer 2 RNNs trained for language modeling
d, i.e., in dependent to governor direction the first element and machine translation.
of the tuple i is dependent of the relation and the second Belinkov et al. [3] probe a recurrent neural machine
element j is the governor; A is a self-attention matrix and translation (NMT) system with four layers to predict part
Ai,· denotes ith row of the matrix; N is the sequence length. of speech tags (along with morphological features). They
use Arabic, Hebrew, French, German, and Czech to En-
glish pairs. They observe that adding a character-based
4 Morphology and Syntax in Word representation computed by a convolutional neural net-
Embeddings and Latent Vectors work in addition to word-embedding input is beneficial,
especially for morphologically rich languages.
In this section, we summarize the research on the syntactic In a subsequent study [4], the source language of trans-
information captured by vector representations of words. lation now is English and the experiments are conducted
We devote a significant attention to POS tagging, which solely for this language. It is noted that the most mor-
is a popular evaluation objective. Even though it is a mor- phosyntactic representation is usually obtained in the mid-
phological task, it is highly relevant to syntactic analysis. dle layers of the network.
The influence of using a particular objective in pre-
4.1 Syntactic Analogies training RNN model is comprehensively analyzed by
Blevins et al. [5]. They pre-train models on four objectives:
The first wave of research on the vector representation syntactic parsing, semantic role labeling, machine transla-
of words focused on the statistical distribution of words tion, and language modeling. The two former objectives
across distinct topics – Latent Semantic Analysis [8]. It may reveal morphosyntactic information to a larger extent
captured statistical properties of words, yet there were no than other mentioned here settings. Particularly, the probe
positive results in syntactic analogies retrieval nor encod- of RNN syntactic parser achieves near-perfect accuracy in
ing syntax. part of speech tagging.
Google Analogy Test Set was released together with a The introduction of ELMo [29] brought a remarkable
popular word embedding algorithm Word2Vec [23]. One advancement in transfer learning from the RNN language
of the exceptional properties of this method was its high model to a variety of other NLP tasks. The authors ex-
accuracy in the analogy tasks. In particular, the best con- amined POS capabilities of the representations and com-
figuration found the correct syntactic analogy in 68.9 % of pared the results with the neural machine translation sys-
cases. tem CoVe [22], which also uses RNN architecture.
The GloVe embeddings improved the results on syntac- Zhang et al. [39] perform further experiments with
tic analogies to 69.3% [28]. Much more significant im- CoVe and ELMo. They demonstrate that language model-
provement was reported for semantic analogies. They also ing systems are better suited to capture morphology and
outperform the variety of other vectorization methods. syntax in the hidden states than machine translation, if
In [24] a simple recurrent neural network was trained comparable amounts of data are used to train both systems.
by language modeling objective. The word representation Moreover, the corpora for language modeling are typically
is taken from the input layer. The evaluation from [23] more extensive than for machine translation, which can
shows that Word2Vec performs better in syntactic anal- further improve the results.
ogy task. This observation is surprising because repre- Another comprehensive evaluation of morphological
sentations from RNN were proven effective in transfer to and syntactic capabilities of language models was con-
other syntactic tasks (we elaborate on that in Sections 4.2 ducted by Liu et al. [17]. Probing was applied to a language
and 4.3). We think that possible explanations could be: 1. model based on the Transformer architecture (BERT)
the techniques of RNN training have crucially improved and compared with ELMo and static word embeddings
in recent years; 2. syntactic analogy focuses on particular (Word2Vec). They observe that the hidden states of Trans-
words, while for other syntactic tasks, the context is more former do not demonstrate a major increase in probed POS
important. accuracy over the RNN model, even though it is more com-
plex and consists of a larger number of parameters.
POS tag probing was also performed for languages other
4.2 Part of Speech Tagging
than English. For instance, Musil [25] trains translation
Measuring to what extent a linguistic feature such as POS systems (with RNN and Transformer architecture) from
is captured in word representations is usually performed Czech to English and examines the learned input embed-
by the method called probing. In probing, the parameters dings of the model and compares them to a Word2Vec
of the pretrained network are fixed, the output word rep- model trained on Czech.
resentations are computed as in the inference mode and 2 Layer numbering in this work: We are numbering layers starting
then fed to a simple neural layer. Only this simple layer is from one for the layer closest to the input. Please note that original papers
optimized for a new task. may use different numbering.
100.0 100.0
97.5 97.5
Machine Translation
Machine Translation
95.0 95.0
92.5 92.5
90.0 Blevins et al. 2018 [5] 90.0 Belinkov et al. 2017a [3]
Peters et al. 2018 [29] Belinkov et al. 2017b [4]
87.5 Zhang and 87.5 Zhang and
Bowman 2018 [39] Bowman 2018 [39]
85.0 85.0
85.0 87.5 90.0 92.5 95.0 97.5 100.0 85.0 87.5 90.0 92.5 95.0 97.5 100.0
Language Model Auto Encoder
(a) Neural machine translation compared with language mod- (b) Neural machine translation compared with auto encoder
eling pre-training pre-training
Figure 3: Accuracy of POS tag probing from RNN representation by the pre-training objective.
100 not require to encode morphosyntax in the latent space.
The difference between the results of machine translation
and language modeling is small. Zhang et al. [39] show
RNN Representaion
95
that using a larger corpus for pre-training improves the
POS accuracy. The main advantage of language models is
90 that monolingual data is much easier to obtain than parallel
Belinkov et al. 2017b [4] sentences necessary to train a machine translation system.
85 Blevins et al. 2018 [5]
Musil 2019 [25]
Liu et al. 2019 [17] 4.3 Syntactic Structure Induction
80
80 85 90 95 100 Extraction of dependency structure is more demanding be-
Static Word Embeddings cause instead of prediction for single tokens, every pair of
words need to be evaluated.
Figure 4: Accuracy of POS tag probing from RNN latent Blevins et al. [5] propose a feed-forward layer on top
vectors compared with static word embeddings of a frozen RNN representation to predict whether a de-
pendency tree edge connects a pair of tokens. They con-
catenate the vector representation of each of the words and
In Figures 3 and 4, we present a comparison of different their element-wise product. Such a representation is fed as
settings for POS tag probing. Each point denotes a pair of an input to the binary classifier. It only looks on a pair of
results obtained in the same paper and the same dataset, tokens at a time, therefore predicted edges may not form a
but with different types of embeddings or pretraining ob- valid tree.
jectives. Therefore, we can observe that the setting plotted Another approach, induction of the whole syntactic
on the y-axis is better than the x-axis setting if the points structures from latent representations was proposed by He-
are above identity function (red dashed line). We cannot witt and Manning [12]. Their syntactic probing is based on
say whether a method represented by another point per- training a matrix which is used to transform the output of
forms better, as the evaluation settings differ. network’s layers (they use BERT and ELMo). The objec-
Figure 4 clearly shows that the RNN contextualization tive of the probing is to approximate dependency tree dis-
helps in part of speech tagging. As expected, the informa- tances between tokens 3 by the L2 norm of the difference
tion about neighboring tokens is essential to predict mor- of the transformed vectors. Probing produces the approx-
phosyntactic functions of words correctly. It is especially imate syntactic pairwise distances for each pair of tokens.
true for the homographs, which can have various part of The minimum spanning tree algorithm is used on the dis-
speech in different places in the text. tance matrix to find the undirected dependency tree. The
best configuration employs the 15th layer of BERT large
The influence of RNN’s pre-training task is presented and induces treebank with 82.5% UAS on Penn Treebank
in Figure 3. Machine translation captures much better POS with Stanford Dependency annotation (relation directions
information than auto-encoders, which can be interpreted and punctuation were disregarded in the experiments). The
as translation from and to the same language. It is likely
that the latter task is straightforward and therefore does 3 Tree distance is the length of the tree path between two tokens
result for BERT is significantly higher than for ELMo,
which gave 77.0% when the first layer was probed.
The paper also describes an alternative method of ap-
proximating the syntactic depth by the L2 norm of la-
tent vector multiplied by a trainable matrix. The estimated
depths allow prediction of the root of a sentence with
90.1% accuracy when representation from the 16th layer
of BERT large is probed.
4.4 Multilingual Representations
The subsequent paper by Chi et al. [6] applies the set-
ting from [12] to the multilingual language model mBERT.
They train syntactic distance probes on 11 languages and
compare UAS of induced trees in four scenarios: 1. train-
ing and evaluating on the same languages; 2. training on
a single language, evaluating on a different one; 3. train-
ing on all languages except the evaluation one; 4. train-
ing on all languages, including the evaluation one. They
demonstrate that the transfer is effective as the results in all
the configurations outperform the baselines4 . Even in the Figure 5: Two dimensional t-SNE visualization of probed
hardest case – zero-shot transfer from just one language, mBERT embeddings from [6]. Analysis of the clusters
the result is at least 6.9 percent points above the base- shows that embeddings encode information about the type
lines (for Chinese). Nevertheless, for all the languages, no of dependency relations and, to a lesser extent, language.
transfer-learning setting can beat the training and evaluat-
ing a probe on the same language.
with a possible syntactic interpretation – the weights of the
The paper includes analysis of intrinsic features of the
self-attention heads. In each head, information can flow
BERT’s vectors transformed by a probe. Noticeably, the
from each token to any other one. These connections may
vector differences between the representations of words
be easily analyzed and compared to syntactic relations pro-
connected by dependency relation are clustered by relation
posed by linguists. In this section, we will summarize dif-
labels, see figure 5.
ferent approaches of extracting syntax from attention. We
Multilingual BERT embeddings are also analyzed by
present the methods both for dependency and constituency
Wang et al. [36]. They show that even for the multilingual
structures.
vectors, the results can be improved by projecting vector
spaces across languages. They use Biaffine Graph-based
Parser by Dozat and Manning [10], which consists of mul- 5.1 Dependency Trees
tiple RNN layers. Therefore, the experiment is not strictly
Raganato and Tiedemann [31] induce dependency trees
comparable with probing as the most of syntactic informa-
from self-attention matrices of a neural machine transla-
tion is captured by the parser, and not by the embeddings.
tion encoder. They use the maximum spanning tree algo-
The article compares different types of vector representa-
rithm to connect pairs of tokens with high attention. Gold
tions fed as an input to the parser. It is demonstrated that
root information is used to find the direction of the edges.
cross-lingual transformation on mBERT embedding im-
Trees extracted in this way are generally worse than the
proves the results significantly in LAS of parser trained
right-branching baseline (35.08 % UAS on PUD) and out-
on English and evaluated on 14 languages (including En-
perform it slightly in a few heads. The maximum UAS
glish); on average, from 60.53% to 63.54%. In compar-
is obtained when a dependency structure is induced from
ison to other cross-lingual representations, the proposed
one head of the 5th layer of English to Chinese encoder
method outperforms transformed static embeddings (Fast-
- 38.87% UAS. Nevertheless, their approach assumes that
Text with SVD) and also slightly outperforms contextual
the whole syntactic tree may be induced from just one at-
embeddings (XLM).
tention head.
Recent articles focused on the analysis of features and
5 Syntax in Transformer’s Attention classification of Transformer’s self-attention heads. Vig
Matrices and Belinkov [34] apply multiple metrics to examine prop-
erties of attention matrices computed in a unidirectional
Besides the vector representations of individual tokens, language model (GPT-2 [30]). They showed that in some
the Transformer architecture offers another representation heads, the attentions concentrate on tokens representing
4 There are two baselines: right-branching tree and probing on ran- specific POS tags and the pairs of tokens are more often
domly initialized mBERT without pretraining attended one to another if an edge in the dependency tree
Research Transformer Model Type of tree Syntactic Evaluation data Percentage
evaluation of syntactic
heads
Raganato and NMT Encoder Dependency Tree induction PUD [27] 0% - 8%5
Tiedemann 2019 [31] (6 layers 8 heads)
Vig and Belinkov 2019 LM (GPT-2) Dependency Dependency Wikipedia (automati- —
[34] Alignment cally annotated)
Clark et al. 2019 [7] LM (BERT) Dependency Dependency WSJ Penn Treebank —
Accuracy, [20]
Tree induction
Voita et al. 2019 [35] NMT Encoder Dependency Dependency WMT, OpenSubtitles 15% - 19%
(6 layers 8 heads) Accuracy [16] (both automati-
cally annotated)
Limisiewicz et al. 2020 LMs Dependency Dependency PUD [27], EuroParl 46%
[15] (BERT, mBERT) Accuracy, [14] (automatically
Tree induction annotated)
Mareček and Rosa 2019 NMT Encoder Constituency Tree induction EuroParl [14] (automat- 19% - 33%
[21] (6 layers 16 heads) ically annotated)
Kim et al. 2019 [13] LMs (BERT, GPT2, Constituency Tree induction WSJ Penn Treebank —
RoBERTa, XLNet) [20], MNLI [37]
Table 1: Summary of syntactic properties observed in Transformer’s self-attention heads
connects them, i.e., dependency alignment is high. They They identify syntactic heads that significantly outperform
observe that the strongest dependency alignment occurs in positional baseline for the following labels: prepositional
the middle layers of the model – 4th and 5th. They also object, determiner, direct object, possession modifier, aux-
point that different dependency types (labels) are captured iliary passive, clausal component, marker, phrasal verb
in different places of the model. Attention in upper lay- particle. The syntactic heads are found in the middle layers
ers aligns more with subject relations whereas in the lower (4th to 8th). However, there is no single head that would
layer with modifying relations, such as auxiliaries, deter- capture the information for all the relations.
miners, conjunctions, and expletives. In another experiment, Clark et al. [7] induce a depen-
Voita et al. [35] also observed alignment with depen- dency tree from attentions. Instead of extracting structure
dency relations in the encoders of neural machine transla- from each head [31] they use probing to find the weighted
tion systems from English to Russian, German, or French. average of all heads. The maximum spanning tree algo-
They have evaluated dependency accuracy for four depen- rithm is used to induce the dependency structure from the
dency labels: noun subject, direct object, adjective mod- average. This approach produces trees with 61% UAS and
ifier, and adverbial modifier. They separately address the can be improved to 77% by making weights dependent on
cases where a verb attends to a dependent subject, and sub- the static word representation (fixed GloVe vectors). Both
ject attends to governor verb. The heads with more than the numbers are significantly higher than right branching
10% improvement over a positional baseline are identified baseline 27%.
as syntactic 6 . Such heads are found in all encoder lay- A related analysis for English (BERT) and the multilin-
ers except the first one. In further experiments, the authors gual variant (mBERT) was conducted by Limisiewicz et
propose the algorithm to prune the heads from the model al. [15]. We have observed that the information about one
with a minimal decrease in translation performance. Dur- dependency type is split across many self-attention heads
ing pruning, the share of syntactic heads rises from 17% and in other cases, the opposite happens - many heads have
in the original model to 40% when 75% heads are cut out, the same syntactic function. They extract labeled depen-
while a change in translation score is negligible. These dency trees from the averaged heads and achieves 52%
results support the claim that the model’s ability to cap- UAS and show that in the multilingual model (mBERT)
ture syntax is essential to its performance in non-syntactic specific relation (noun subject, determines) are found in
tasks. the same heads across typologically similar languages.
A similar evaluation of dependency accuracy for the
BERT language model was conducted by Clark et al. [7]. 5.2 Constituency trees
5 A head is syntactic when the tree extracted from it surpasses the There are fewer papers devoted to deriving constituency
right-branching chain in terms of UAS. It is a strong baseline for syntactic syntax tree structures.
trees in English. Thus only a few heads are recognized as syntactic.
6 In the positional baseline, the most frequent offset is added to the in- Mareček and Rosa [21] examined the encoder of the
dex of relation’s dependent/governor to find its governor/dependent, e.g., machine translation system for translation between En-
for adjective to noun relations the most frequent offset is +1 in English glish, French, and German. We observed that in some
LAYER: 6 HEAD: 8 LAYER: 4 HEAD: 13
there huge
is areas
considerable X covering
energy thousands
saving X of
hectares
potential of
in vineyards
public X have
buildings been
burned
AMOD D2P
, ;
for this
example means
, that
which the
would vine
facilitate -
growers
the have
transition suffered
towards loss
a and
stable X that
their
, plants
green X have
economy been
. damaged
.
-
;
.
of
of
loss
vine
the
this
and
LAYER: 8 HEAD: 10
that
that
have
have
have
been
been
huge
their
areas
plants
means
burned
suffered
covering
growers
damaged
vineyards
hectares
thousands
there
is
considerable
energy
saving
potential X
in Figure 7: Balustrades observed in NMT’s encoder tend to
public
buildings overlap with syntactic phrases.
,
OBJ D2P
for
example
,
which
would by up to 8.10 percent points in English.
facilitate
the The extraction of constituency trees from language
transition X
towards models was described by Kim et al. [13]. They present
a
stable a comprehensive study that covers nine types of pre-
,
green trained networks: BERT (base, large), GPT-2 [30] (orig-
economy inal, medium), RoBERTa [19] (base, large), XLNet [38]
.
(base, large). Their approach is based on computing dis-
,
,
,
.
a
is
in
for
the
which
would
public
saving
there
green
stable
energy
economy
example
tance between each pair of subsequent words. In each step,
buildings
towards
facilitate
potential
transition
considerable
they are branching the tree in the place where the distance
is the highest. The authors try three distance measures on
the vector outputs of the encoder layer (cosine, L1, and L2
distances for pairs of vectors) and two distance measures
Figure 6: Self-attentions in particular heads of a language on the distributions of token’s attention (Jason-Shannon
model (BERT) aligns with dependency relation adjective and Hellinger distances for pairs of distribution). In the
modifiers and objects. The gold relations are marked with former case, distances are computed only per layer and in
Xs. the latter case for each head and average of heads in one
layer. The best setting achieves 40.1% F1 score on WSJ
Penn Treebank. It uses XLNet-base and Helinger distance
heads, stretches of words attend to the same token form- on averaged attentions in the 7th layer. Generally, attention
ing shapes similar to balustrades (Figure 7). Furthermore, distribution distances perform better than vector ones. Au-
those stretches usually overlap with syntactic phrases. This thors also observe that models trained on regular language
notion is employed in the new method for constituency tree modeling objective (i.e., next word prediction in GPT, XL-
induction. In their algorithm, the weights for each stretch Net) captured syntax better than masked language models
of tokens are computed by summing the attention focused (BERT, RoBERTa). In line with the previous research, the
on the balustrades and then inducing a constituency tree middle layers tend to be more syntactic.
with CKY algorithm [26]. As a result, we produce trees
that achieve up to 32.8% F1 score for English sentences,
5.3 Syntactic information across layers
43.6% for German and 44.2% for French. 7 The results can
be improved by selecting syntactic heads and using only Figure 8 summarizes the evaluation of syntactic informa-
them in the algorithm. This approach requires a sample of tion across layers for different approaches. In Transformer-
100 annotated sentences for head selection and raises F1 based language models: BERT, mBERT, and GPT-2, the
middle layers are the most syntactic. In neural machine
7 The evaluation was done on 1000 sentences for each language translation models, the top layers of the encoder are the
parsed with supervised Stanford Parsed most syntactic. However, it is important to note that the
5 6 7 8 9 10 11 12 1.0
0.9
0.8
Layer number
0.7
0.6
4
0.5
3
2
0.4
1
fi
T-2
o
zh
tr
et
ru
de
en2
LM
e
ase
en2
en2s
ase
en2
en2
en2
bas
2
24
n2c
GP
1-1
b
E
Tb
13-
e
RT
RT
D)
G)
ge
MT
ER
ge
BE
BE
lar
lar
mB
N
B)
RT
E)
F)
RT
C)
BE
BE
A)
Figure 8: Relative syntactic information across attention models and layers. The values are normalized so that the best
layer for each method has 1.0. The methods A), B), C), and G) show undirected UAS trees extracted by probing the n-th
layer [12, 6]. The method D) shows the dependency alignment averaged across all heads in each layer [34]. The methods
E) and F) show UAS of trees induced from attention heads by the maximum spanning tree algorithm [31, 15]. The results
for the best layer (corresponding to value 1.0 in the plot) are: A) 82.5; B) 79.8; C) 80.1; D) 22.3; E) 24.3; F) en2cs: 23.9,
en2de: 20.9, en2et: 22.1, en2fi: 24.0, en2ru: 22.4, en2tr: 17.5, en2zh: 21.6; G) 77.0
NMT Transformer encoder is only the first half of the 2. Pretraining on tasks with masked input (language
whole translation architecture, and therefore the most syn- modeling or machine translation) produces better
tactic layers are, in fact, in the middle of the process. In syntactic representation than auto encoding.
RNN language model (ELMo) the first layer is more syn- 3. The advantage of language modeling over machine
tactic than the second one. translation is the fact that larger corpora are available
We conjecture that the initial Transformer’s layers cap- for pretraining.
ture simple relations (e.g., attending to next or previous Our meta-analysis of latent states showed that the most
tokens) and the last layers mostly capture task-specific in- syntactic representation could be found in the middle lay-
formation. Therefore, they are less syntactic. ers of the model. They tend to capture more complex re-
We also observe that in supervised probing [12, 6], bet- lations than initial layers, and the representations are less
ter results are obtained from initial and top layers than in dependent on the pretraining objectives than in the top lay-
unsupervised structure induction [31, 15], i.e., the distri- ers.
bution across layers is smoother. We have shown to what extent systems trained for a non-
syntactic task can learn grammatical structures. The ques-
tion we leave for further research is whether providing ex-
6 Conclusion plicit syntactic information to the model can improve its
performance on other NLP tasks.
In this overview, we survey that syntactic structures are
latently learned by the neural models for natural language Acknowledgments
processing tasks. We have compared multiple approaches
of others and described the features that affect the ability to This work has been supported by the grant 18-02196S of
capture the syntax. The following aspects tend to improve the Czech Science Foundation. It has been using language
the performance on syntactic tasks such as POS tagging: resources and tools developed, stored and distributed by
1. Using contextual embeddings from RNNs or theLINDAT/CLARIAH-CZ project of the Ministry of Ed-
Transformer outperforms static word embeddings ucation, Youth and Sports of the Czech Republic (project
(Word2Vec, GloVe). LM2018101).
References [10] Timothy Dozat and Christopher D. Manning. Deep
biaffine attention for neural dependency parsing. In
[1] Felipe Almeida and Geraldo Xexéo. Word embed- 5th International Conference on Learning Represen-
dings: A survey. CoRR, abs/1901.09069, 2019. tations, ICLR 2017, Toulon, France, April 24-26,
2017, Conference Track Proceedings, 2017.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
[11] Zellig Harris. Distributional structure. Word,
Bengio. Neural machine translation by jointly learn-
10(23):146–162, 1954.
ing to align and translate. CoRR, abs/1409.0473,
2015. [12] John Hewitt and Christopher D. Manning. A struc-
tural probe for finding syntax in word representa-
[3] Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Has- tions. In NAACL-HLT, 2019.
san Sajjad, and James Glass. What do neural ma-
chine translation models learn about morphology? In [13] Taeuk Kim, Jihun Choi, Daniel Edmiston, and Sang-
Proceedings of the 55th Annual Meeting of the As- goo Lee. Are Pre-trained Language Models Aware of
sociation for Computational Linguistics (Volume 1: Phrases? Simple but Strong Baselines for Grammar
Long Papers), pages 861–872, Vancouver, Canada, Induction. In International Conference on Learning
July 2017. Association for Computational Linguis- Representations, January 2020.
tics. [14] Philipp Koehn. Europarl: A parallel corpus for sta-
tistical machine translation. 5, 11 2004.
[4] Yonatan Belinkov, Lluís Màrquez, Hassan Sajjad,
Nadir Durrani, Fahim Dalvi, and James Glass. Eval- [15] Tomasz Limisiewicz, Rudolf Rosa, and David
uating layers of representation in neural machine Mareček. Universal dependencies according to
translation on part-of-speech and semantic tagging BERT: both more specific and more general. ArXiv,
tasks. In Proceedings of the Eighth International abs/2004.14620, 2020.
Joint Conference on Natural Language Processing
[16] Pierre Lison, Jörg Tiedemann, and Milen Kouylekov.
(Volume 1: Long Papers), pages 1–10, Taipei, Tai-
OpenSubtitles2018: Statistical rescoring of sentence
wan, November 2017. Asian Federation of Natural
alignments in large, noisy parallel corpora. In Pro-
Language Processing.
ceedings of the Eleventh International Conference on
[5] Terra Blevins, Omer Levy, and Luke Zettlemoyer. Language Resources and Evaluation (LREC 2018),
Deep RNNs encode soft hierarchical syntax. In Miyazaki, Japan, May 2018. European Language Re-
Proceedings of the 56th Annual Meeting of the As- sources Association (ELRA).
sociation for Computational Linguistics (Volume 2: [17] Nelson F. Liu, Matt Gardner, Yonatan Belinkov,
Short Papers), pages 14–19, Melbourne, Australia, Matthew E. Peters, and Noah A. Smith. Linguistic
July 2018. Association for Computational Linguis- knowledge and transferability of contextual represen-
tics. tations. In NAACL-HLT, 2019.
[6] Ethan A. Chi, John Hewitt, and Christopher D. Man- [18] Qi Liu, Matt J. Kusner, and Phil Blunsom. A survey
ning. Finding universal grammatical relations in on contextual embeddings. ArXiv, abs/2003.07278,
multilingual BERT. In Proceedings of the 58th An- 2020.
nual Meeting of the Association for Computational
[19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du,
Linguistics, pages 5564–5577, Online, July 2020.
Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Association for Computational Linguistics.
Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A
[7] Kevin Clark, Urvashi Khandelwal, Omer Levy, and robustly optimized bert pretraining approach. arXiv
Christopher D. Manning. What does BERT look at? preprint arXiv:1907.11692, 2019.
An analysis of BERT’s attention, 2019. [20] Mitchell P. Marcus, Beatrice Santorini, and
Mary Ann Marcinkiewicz. Building a large an-
[8] Scott Deerwester, Susan T. Dumais, George W. notated corpus of English: The Penn Treebank.
Furnas, Thomas K. Landauer, and Richard Harsh- Computational Linguistics, 19(2):313–330, 1993.
man. Indexing by latent semantic analysis. Jour-
nal of the American Society for Information Science, [21] David Mareček and Rudolf Rosa. From balustrades
41(6):391–407, 1990. to pierre vinken: Looking for syntax in transformer
self-attentions. In Proceedings of the 2019 ACL
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Workshop BlackboxNLP: Analyzing and Interpreting
Kristina Toutanova. Bert: Pre-training of deep bidi- Neural Networks for NLP, pages 263–275, Florence,
rectional transformers for language understanding. In Italy, August 2019. Association for Computational
NAACL-HLT, 2019. Linguistics.
[22] Bryan McCann, James Bradbury, Caiming Xiong, Phương Lê Hồng, Alessandro Lenci, Saran Lert-
and Richard Socher. Learned in translation: Contex- pradit, Herman Leung, Cheuk Ying Li, Josie Li,
tualized word vectors. In Advances in Neural Infor- Nikola Ljubešić, Olga Loginova, Olga Lyashevskaya,
mation Processing Systems, pages 6297–6308, 2017. Teresa Lynn, Vivien Macketanz, Aibek Makazhanov,
Michael Mandl, Christopher Manning, Ruli Manu-
[23] Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- rung, Cătălina Mărănduc, David Mareček, Katrin
frey Dean. Efficient estimation of word represen- Marheinecke, Héctor Martínez Alonso, André Mar-
tations in vector space. CoRR, abs/1301.3781, July tins, Jan Mašek, Yuji Matsumoto, Ryan McDon-
2013. ald, Gustavo Mendonça, Anna Missilä, Verginica
Mititelu, Yusuke Miyao, Simonetta Montemagni,
[24] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Amir More, Laura Moreno Romero, Shunsuke
Linguistic regularities in continuous space word rep- Mori, Bohdan Moskalevskyi, Kadri Muischnek,
resentations. In Proceedings of the 2013 Confer- Nina Mustafina, Kaili Müürisep, Pinkey Nainwani,
ence of the North American Chapter of the Asso- Anna Nedoluzhko, Lương Nguyễn Thị, Huyền
ciation for Computational Linguistics: Human Lan- Nguyễn Thị Minh, Vitaly Nikolaev, Rattima Ni-
guage Technologies, pages 746–751, Atlanta, Geor- tisaroj, Hanna Nurmi, Stina Ojala, Petya Osen-
gia, June 2013. Association for Computational Lin- ova, Lilja Øvrelid, Elena Pascual, Marco Pas-
guistics. sarotti, Cenel-Augusto Perez, Guy Perrier, Slav
Petrov, Jussi Piitulainen, Emily Pitler, Barbara Plank,
[25] Tomáš Musil. Examining Structure of Word Em-
Martin Popel, Lauma Pretkalniņa, Prokopis Proko-
beddings with PCA. In Text, Speech, and Dialogue,
pidis, Tiina Puolakainen, Sampo Pyysalo, Alexan-
pages 211–223. Springer International Publishing,
dre Rademaker, Livy Real, Siva Reddy, Georg Rehm,
2019.
Larissa Rinaldi, Laura Rituma, Rudolf Rosa, Davide
[26] H. Ney. Dynamic programming parsing for context- Rovati, Shadi Saleh, Manuela Sanguinetti, Baiba
free grammars in continuous speech recognition. Saulı̄te, Yanin Sawanakunanon, Sebastian Schus-
IEEE Transactions on Signal Processing, 39(2):336– ter, Djamé Seddah, Wolfgang Seeker, Mojgan Ser-
340, 1991. aji, Lena Shakurova, Mo Shen, Atsuko Shimada,
Muh Shohibussirri, Natalia Silveira, Maria Simi,
[27] Joakim Nivre, Željko Agić, Lars Ahrenberg, Lene Radu Simionescu, Katalin Simkó, Mária Šimková,
Antonsen, Maria Jesus Aranzabe, Masayuki Asa- Kiril Simov, Aaron Smith, Antonio Stella, Jana Str-
hara, Luma Ateyah, Mohammed Attia, Aitziber nadová, Alane Suhr, Umut Sulubacak, Zsolt Szántó,
Atutxa, Elena Badmaeva, Miguel Ballesteros, Esha Dima Taji, Takaaki Tanaka, Trond Trosterud, Anna
Banerjee, Sebastian Bank, John Bauer, Kepa Ben- Trukhina, Reut Tsarfaty, Francis Tyers, Sumire Ue-
goetxea, Riyaz Ahmad Bhat, Eckhard Bick, Cristina matsu, Zdeňka Urešová, Larraitz Uria, Hans Uszko-
Bosco, Gosse Bouma, Sam Bowman, Aljoscha Bur- reit, Gertjan van Noord, Viktor Varga, Veronika
chardt, Marie Candito, Gauthier Caron, Gülşen Ce- Vincze, Jonathan North Washington, Zhuoran Yu,
biroğlu Eryiğit, Giuseppe G. A. Celano, Savas Cetin, Zdeněk Žabokrtský, Daniel Zeman, and Hanzhi
Fabricio Chalub, Jinho Choi, Yongseok Cho, Sil- Zhu. Universal dependencies 2.0 – CoNLL 2017
vie Cinková, Çağrı Çöltekin, Miriam Connor, Marie- shared task development and test data, 2017. LIN-
Catherine de Marneffe, Valeria de Paiva, Arantza DAT/CLARIN digital library at the Institute of For-
Diaz de Ilarraza, Kaja Dobrovoljc, Timothy Dozat, mal and Applied Linguistics (ÚFAL), Faculty of
Kira Droganova, Marhaba Eli, Ali Elkahky, Tomaž Mathematics and Physics, Charles University.
Erjavec, Richárd Farkas, Hector Fernandez Al-
[28] Jeffrey Pennington, Richard Socher, and Christo-
calde, Jennifer Foster, Cláudia Freitas, Katarína
pher D. Manning. Glove: Global vectors for word
Gajdošová, Daniel Galbraith, Marcos Garcia, Filip
representation. In Empirical Methods in Natural
Ginter, Iakes Goenaga, Koldo Gojenola, Memduh
Language Processing (EMNLP), pages 1532–1543,
Gökırmak, Yoav Goldberg, Xavier Gómez Guino-
2014.
vart, Berta Gonzáles Saavedra, Matias Grioni, Nor-
munds Grūzı̄tis, Bruno Guillaume, Nizar Habash, [29] Matthew E. Peters, Mark Neumann, Mohit Iyyer,
Jan Hajič, Jan Hajič jr., Linh Hà Mỹ, Kim Harris, Matt Gardner, Christopher Clark, Kenton Lee, and
Dag Haug, Barbora Hladká, Jaroslava Hlaváčová, Luke Zettlemoyer. Deep contextualized word rep-
Petter Hohle, Radu Ion, Elena Irimia, Anders Jo- resentations. In Proceedings of the 2018 Confer-
hannsen, Fredrik Jørgensen, Hüner Kaşıkara, Hi- ence of the North American Chapter of the Asso-
roshi Kanayama, Jenna Kanerva, Tolga Kayade- ciation for Computational Linguistics: Human Lan-
len, Václava Kettnerová, Jesse Kirchner, Natalia guage Technologies, Volume 1 (Long Papers), New
Kotsyba, Simon Krek, Sookyoung Kwak, Veronika Orleans, Louisiana, June 2018. Association for Com-
Laippala, Lorenzo Lambertino, Tatiana Lando, putational Linguistics.
[30] Alec Radford, Jeff Wu, Rewon Child, David Luan, Louisiana, June 2018. Association for Computational
Dario Amodei, and Ilya Sutskever. Language models Linguistics.
are unsupervised multitask learners. 2019.
[38] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.
[31] Alessandro Raganato and Jörg Tiedemann. An anal- Carbonell, Ruslan Salakhutdinov, and Quoc V. Le.
ysis of encoder representations in transformer-based Xlnet: Generalized autoregressive pretraining for
machine translation. In Proceedings of the 2018 language understanding. In NeurIPS, 2019.
EMNLP Workshop BlackboxNLP: Analyzing and In-
terpreting Neural Networks for NLP, pages 287–297, [39] Kelly W. Zhang and Samuel R. Bowman. Language
Brussels, Belgium, November 2018. Association for modeling teaches you more syntax than translation
Computational Linguistics. does: Lessons learned through auxiliary task anal-
ysis. In Proceedings of the 2018 EMNLP Work-
[32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob shop BlackboxNLP: Analyzing and Interpreting Neu-
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz ral Networks for NLP, November 2018.
Kaiser, and Illia Polosukhin. Attention is all you
need. In Advances in Neural Information Processing
Systems 30: Annual Conference on Neural Informa-
tion Processing Systems 2017, 4-9 December 2017,
Long Beach, CA, USA, pages 5998–6008, 2017.
[33] Jesse Vig. A multiscale visualization of attention
in the transformer model. In Proceedings of the
57th Conference of the Association for Computa-
tional Linguistics, ACL 2019, Florence, Italy, July
28 - August 2, 2019, Volume 3: System Demonstra-
tions, pages 37–42. Association for Computational
Linguistics, 2019.
[34] Jesse Vig and Yonatan Belinkov. Analyzing the
Structure of Attention in a Transformer Language
Model. In Proceedings of the 2019 ACL Work-
shop BlackboxNLP: Analyzing and Interpreting Neu-
ral Networks for NLP, pages 63–76, Florence, Italy,
August 2019. Association for Computational Lin-
guistics.
[35] Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
nrich, and Ivan Titov. Analyzing multi-head self-
attention: Specialized heads do the heavy lifting, the
rest can be pruned. In Proceedings of the 57th An-
nual Meeting of the Association for Computational
Linguistics, pages 5797–5808, Florence, Italy, July
2019. Association for Computational Linguistics.
[36] Yuxuan Wang, Wanxiang Che, Jiang Guo, Yijia Liu,
and Ting Liu. Cross-lingual bert transformation for
zero-shot dependency parsing. Proceedings of the
2019 Conference on Empirical Methods in Natu-
ral Language Processing and the 9th International
Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), 2019.
[37] Adina Williams, Nikita Nangia, and Samuel Bow-
man. A broad-coverage challenge corpus for sen-
tence understanding through inference. In Proceed-
ings of the 2018 Conference of the North Ameri-
can Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume
1 (Long Papers), pages 1112–1122, New Orleans,