=Paper=
{{Paper
|id=Vol-2718/paper16
|storemode=property
|title=Syntax Representation in Word Embeddings and Neural Networks – A Survey
|pdfUrl=https://ceur-ws.org/Vol-2718/paper16.pdf
|volume=Vol-2718
|authors=Tomasz Limisiewicz,David Mareček
|dblpUrl=https://dblp.org/rec/conf/itat/LimisiewiczM20
}}
==Syntax Representation in Word Embeddings and Neural Networks – A Survey==
<pdf width="1500px">https://ceur-ws.org/Vol-2718/paper16.pdf</pdf>
<pre>
Syntax Representation in Word Embeddings and Neural Networks – A Survey

                                                 Tomasz Limisiewicz and David Mareček

                  Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University
                                           {limisiewicz,marecek}@ufal.mff.cuni.cz

Abstract: Neural networks trained on natural language                      obtained by counting word frequency across documents on
processing tasks capture syntax even though it is not pro-                 distinct subjects.
vided as a supervision signal. This indicates that syntactic                  In more recent approaches, a shallow neural network is
analysis is essential to the understating of language in ar-               used to predict each word based on context (Word2Vec
tificial intelligence systems. This overview paper covers                  [23]) or approximate the frequency of coocurence for a
approaches of evaluating the amount of syntactic informa-                  pair of words (GloVe [28]). One explanation of the effec-
tion included in the representations of words for different                tiveness of these algorithms is the distributional hypothesis
neural network architectures. We mainly summarize re-                      [11]: "words that occur in the same contexts tend to have
search on English monolingual data on language modeling                    similar meanings".
tasks and multilingual data for neural machine translation
systems and multilingual language models. We describe
which pre-trained models and representations of language                   2.2   Contextual Word Vectors in Recurrent Networks
are best suited for transfer to syntactic tasks.
                                                                           The main disadvantage of the static word embeddings is
1     Introduction                                                         that they do not take into account the context of words.
                                                                           This is especially an issue for languages rich in words that
Modern methods of natural language processing (NLP) are                    have multiple meanings.
based on complex neural network architectures, where lan-                     The contextual embeddings introduced in [29] and [22]
guage units are represented in a metric space [23, 28, 29,                 are able to encode both words and their contexts. They are
9, 30]. Such a phenomenon allows us to express linguistic                  based on recurrent neural networks (RNN) and are typi-
features (i.e., morphological, lexical, syntactic) mathemat-               cally trained on language modeling or machine translation
ically.                                                                    tasks using large text corpora. The outputs of the RNN lay-
   The method of obtaining such representation and their                   ers are context-dependent representations that are proven
interpretations were described in multiple overview works.                 to perform well when used as inputs for other NLP tasks
Almeida and Xexéo surveyed different types of static word                  with much less training data available.
embeddings [1], and Liu et al. [18] focused on contextual                     Another improvement of context modeling was possible
representations found in the most recent neural models.                    thanks to the attention mechanism [2]. It allowed passing
Belinkov and Glass [4] surveyed the strategies of interpret-               the information from the most relevant part of the RNN en-
ing latent representation. Best to our knowledge, we are                   coder, instead of using only the contextual representation
the first to focus on the syntactic and morphological abil-                of the last token.
ities of the word representations. We also cover the latest
approaches, which go beyond the interpretation of latent
vectors and analyze the attentions present in state-of-the-                2.3   Contextual Representation in Transformers
art Transformer models.
                                                                           The most recent and widely used architecture is the Trans-
                                                                           former [32]. It consists of several (6 to 24) layers, and
2     Vector Representations of Words                                      each token position in each layer has the ability to attend
                                                                           to any position in the previous layer using a self-attention
This section introduces several types of architectures that
                                                                           mechanism. Training such architecture can be easily paral-
we will analyze in this work.
                                                                           lelized since individual tokens can be processed indepen-
                                                                           dently; their positions are encoded within the input em-
2.1   Static Word Embeddings                                               beddings. An example of visualization of attention distri-
                                                                           bution computed in Transformer trained for language mod-
In the classical methods of language representation, each                  eling (BERT [9]) is presented in Figure 1.
word is assigned a vector regardless of its current context.                  In addition to vectors, Transformer includes latent rep-
In the Latent Semantic Analysis [8], the representation was                resentation in the form of self-attention weights, which are
     Copyright c 2020 for this paper by its authors. Use permitted under   two-dimensional matrices. We summarize the research on
Creative Commons License Attribution 4.0 International (CC BY 4.0).        the syntactic properties of attention weights in Section 5.
                                                                                An evaluation example consists of two word pairs rep-
                                                                             resented by the embeddings: (v1 , v2 ), (u1 , u2 ). We compute
                                                                             the analogy shift vector as the difference between embed-
                                                                             dings of the first pair s = v2 − v1 . The result is positive if
                                                                             the nearest word embedding to the vector u1 + s is u2 .

                                                                                             |{(v1 , v2 , u1 , u2 ) : u2 ≈ u1 + v2 − v1 }|
                                                                                      WA =                                                   (1)
                                                                                                          |{(v1 , v2 , u1 , u2 )}|

                                                                             3.2    Sequence Tagging
                                                                             Sequence tagging is a multiclass classification problem.
                                                                             The aim is to predict the correct tag for each token of a se-
Figure 1: Visualization of attention mechanism in Trans-
                                                                             quence. A typical example is the part of speech (POS) tag-
former architecture. It shows which parts of the text are
                                                                             ging. The accuracy evaluation is straightforward: the num-
important to compute the representation for the word “to”.
                                                                             ber of correctly assigned tags is divided by the number of
Created in BertViz framework [33].
                                                                             tokens.

                                                                             3.3    Syntactic structure prediction
                                                                             The inference of reasonable syntactic structures from
                                                                             word representations is the most challenging task cov-
                                                                             ered in our survey. There are attempts to predict both the
                                                                             dependency[12, 31, 15, 7] and constituency trees [21, 13].
                                                                             Dependency trees are evaluated using unlabeled attach-
                                                                             ment score (UAS) or its undirected variant (UUAS):
                                                                                                   #correctly_attached_words
                                                                                         UAS =                                               (2)
                                                                                                           #all_words
Figure 2: Spatial distribution of word embeddings de-                        The equation for Labeled Attachment Score is the same,
pends on syntactic roles of words (visualization created by                  but it requires predicting a dependency label for each edge.
Ashutosh Singh).                                                             For constituency, trees we define precision (P) and recall
                                                                             (R) for correctly predicted phrases.
3     Measures of Syntactic Information                                              #correct_phrases                #correct_phrases
                                                                              P=                      ,       R=                       (3)
                                                                                      #gold_phrases                 #predicted_phrases
This sections describes the metrics used to evaluate syn-
tactic information captured by the word embeddings and                         Usually, F1 is reported, which is a harmonic mean of
latent representation.                                                       precision and recall.

3.1    Syntactic Analogies                                                   3.4    Attention’s Dependency Alignment

In the recent revival of word embeddings[23, 28], a strong                   In Section 5 we describe the examination of syntactic
focus was put on examining the phenomenon of encoding                        properties of self-attention matrices. It can be evaluated
analogies in multidimensional space. That is to say, the                     using Dependency Alignment [34] which sums the atten-
shift vector between pairs of analogous words is approxi-                    tion weights at the positions corresponding to the pairs of
mately constant, e.g., the pairs drinking – drank, swimming                  tokens forming a dependency edge in the tree.
– swam in Figure 2.
   Syntactic analogies of this type are particularly relevant                                                 ∑(i, j)∈E Ai, j
                                                                                               DepAlA =                                      (4)
for this overview. They include the following relations: ad-                                                 ∑Ni=1 ∑Nj=1 Ai, j
jective – adverb; singular – plural; adjective – compara-
tive – superlative; verb – present participle – past partici-                   Dependency Accuracy [35, 7, 15] is an alternative met-
ple. The syntactic analogy is usually evaluated on Google                    ric; for each dependency label it measures how often the
Analogy Test Set [23]. 1                                                     relation’s governor/dependent is the most attended token
                                                                             by the dependent/governor.
    1 The test set is called syntactic by authors; nevertheless, it mostly                         |{(i, j) ∈ El,d : j = arg max Ai,· }|
                                                                                   DepAccl,d,A =                                             (5)
focuses on morphological features.                                                                                 |El,d |
Notation: E is a set of all dependency tree edges and El,d         The number of probing experiments rose with the ad-
is a subset of the edges with the label l and with direction    vent of multilayer 2 RNNs trained for language modeling
d, i.e., in dependent to governor direction the first element   and machine translation.
of the tuple i is dependent of the relation and the second         Belinkov et al. [3] probe a recurrent neural machine
element j is the governor; A is a self-attention matrix and     translation (NMT) system with four layers to predict part
Ai,· denotes ith row of the matrix; N is the sequence length.   of speech tags (along with morphological features). They
                                                                use Arabic, Hebrew, French, German, and Czech to En-
                                                                glish pairs. They observe that adding a character-based
4     Morphology and Syntax in Word                             representation computed by a convolutional neural net-
      Embeddings and Latent Vectors                             work in addition to word-embedding input is beneficial,
                                                                especially for morphologically rich languages.
In this section, we summarize the research on the syntactic        In a subsequent study [4], the source language of trans-
information captured by vector representations of words.        lation now is English and the experiments are conducted
We devote a significant attention to POS tagging, which         solely for this language. It is noted that the most mor-
is a popular evaluation objective. Even though it is a mor-     phosyntactic representation is usually obtained in the mid-
phological task, it is highly relevant to syntactic analysis.   dle layers of the network.
                                                                   The influence of using a particular objective in pre-
4.1   Syntactic Analogies                                       training RNN model is comprehensively analyzed by
                                                                Blevins et al. [5]. They pre-train models on four objectives:
The first wave of research on the vector representation         syntactic parsing, semantic role labeling, machine transla-
of words focused on the statistical distribution of words       tion, and language modeling. The two former objectives
across distinct topics – Latent Semantic Analysis [8]. It       may reveal morphosyntactic information to a larger extent
captured statistical properties of words, yet there were no     than other mentioned here settings. Particularly, the probe
positive results in syntactic analogies retrieval nor encod-    of RNN syntactic parser achieves near-perfect accuracy in
ing syntax.                                                     part of speech tagging.
   Google Analogy Test Set was released together with a            The introduction of ELMo [29] brought a remarkable
popular word embedding algorithm Word2Vec [23]. One             advancement in transfer learning from the RNN language
of the exceptional properties of this method was its high       model to a variety of other NLP tasks. The authors ex-
accuracy in the analogy tasks. In particular, the best con-     amined POS capabilities of the representations and com-
figuration found the correct syntactic analogy in 68.9 % of     pared the results with the neural machine translation sys-
cases.                                                          tem CoVe [22], which also uses RNN architecture.
   The GloVe embeddings improved the results on syntac-            Zhang et al. [39] perform further experiments with
tic analogies to 69.3% [28]. Much more significant im-          CoVe and ELMo. They demonstrate that language model-
provement was reported for semantic analogies. They also        ing systems are better suited to capture morphology and
outperform the variety of other vectorization methods.          syntax in the hidden states than machine translation, if
   In [24] a simple recurrent neural network was trained        comparable amounts of data are used to train both systems.
by language modeling objective. The word representation         Moreover, the corpora for language modeling are typically
is taken from the input layer. The evaluation from [23]         more extensive than for machine translation, which can
shows that Word2Vec performs better in syntactic anal-          further improve the results.
ogy task. This observation is surprising because repre-            Another comprehensive evaluation of morphological
sentations from RNN were proven effective in transfer to        and syntactic capabilities of language models was con-
other syntactic tasks (we elaborate on that in Sections 4.2     ducted by Liu et al. [17]. Probing was applied to a language
and 4.3). We think that possible explanations could be: 1.      model based on the Transformer architecture (BERT)
the techniques of RNN training have crucially improved          and compared with ELMo and static word embeddings
in recent years; 2. syntactic analogy focuses on particular     (Word2Vec). They observe that the hidden states of Trans-
words, while for other syntactic tasks, the context is more     former do not demonstrate a major increase in probed POS
important.                                                      accuracy over the RNN model, even though it is more com-
                                                                plex and consists of a larger number of parameters.
                                                                   POS tag probing was also performed for languages other
4.2   Part of Speech Tagging
                                                                than English. For instance, Musil [25] trains translation
Measuring to what extent a linguistic feature such as POS       systems (with RNN and Transformer architecture) from
is captured in word representations is usually performed        Czech to English and examines the learned input embed-
by the method called probing. In probing, the parameters        dings of the model and compares them to a Word2Vec
of the pretrained network are fixed, the output word rep-       model trained on Czech.
resentations are computed as in the inference mode and              2 Layer numbering in this work: We are numbering layers starting
then fed to a simple neural layer. Only this simple layer is    from one for the layer closest to the input. Please note that original papers
optimized for a new task.                                       may use different numbering.
                          100.0                                                                                                           100.0

                                    97.5                                                                                                   97.5
  Machine Translation


                                                                                                                    Machine Translation
                                    95.0                                                                                                   95.0

                                    92.5                                                                                                   92.5

                                    90.0                                 Blevins et al. 2018 [5]                                           90.0                     Belinkov et al. 2017a [3]
                                                                         Peters et al. 2018 [29]                                                                    Belinkov et al. 2017b [4]
                                    87.5                                 Zhang and                                                         87.5                     Zhang and
                                                                         Bowman 2018 [39]                                                                           Bowman 2018 [39]
                                    85.0                                                                                                   85.0
                                        85.0             87.5    90.0   92.5  95.0          97.5      100.0                                    85.0   87.5   90.0   92.5   95.0        97.5     100.0
                                                                   Language Model                                                                               Auto Encoder


(a) Neural machine translation compared with language mod-                                                      (b) Neural machine translation compared with auto encoder
eling pre-training                                                                                              pre-training

                                                       Figure 3: Accuracy of POS tag probing from RNN representation by the pre-training objective.


                                            100                                                               not require to encode morphosyntax in the latent space.
                                                                                                              The difference between the results of machine translation
                                                                                                              and language modeling is small. Zhang et al. [39] show
                        RNN Representaion


                                             95
                                                                                                              that using a larger corpus for pre-training improves the
                                                                                                              POS accuracy. The main advantage of language models is
                                             90                                                               that monolingual data is much easier to obtain than parallel
                                                                          Belinkov et al. 2017b [4]           sentences necessary to train a machine translation system.
                                             85                           Blevins et al. 2018 [5]
                                                                          Musil 2019 [25]
                                                                          Liu et al. 2019 [17]                4.3                Syntactic Structure Induction
                                             80
                                                  80            85          90        95                100   Extraction of dependency structure is more demanding be-
                                                                 Static Word Embeddings                       cause instead of prediction for single tokens, every pair of
                                                                                                              words need to be evaluated.
Figure 4: Accuracy of POS tag probing from RNN latent                                                            Blevins et al. [5] propose a feed-forward layer on top
vectors compared with static word embeddings                                                                  of a frozen RNN representation to predict whether a de-
                                                                                                              pendency tree edge connects a pair of tokens. They con-
                                                                                                              catenate the vector representation of each of the words and
   In Figures 3 and 4, we present a comparison of different                                                   their element-wise product. Such a representation is fed as
settings for POS tag probing. Each point denotes a pair of                                                    an input to the binary classifier. It only looks on a pair of
results obtained in the same paper and the same dataset,                                                      tokens at a time, therefore predicted edges may not form a
but with different types of embeddings or pretraining ob-                                                     valid tree.
jectives. Therefore, we can observe that the setting plotted                                                     Another approach, induction of the whole syntactic
on the y-axis is better than the x-axis setting if the points                                                 structures from latent representations was proposed by He-
are above identity function (red dashed line). We cannot                                                      witt and Manning [12]. Their syntactic probing is based on
say whether a method represented by another point per-                                                        training a matrix which is used to transform the output of
forms better, as the evaluation settings differ.                                                              network’s layers (they use BERT and ELMo). The objec-
   Figure 4 clearly shows that the RNN contextualization                                                      tive of the probing is to approximate dependency tree dis-
helps in part of speech tagging. As expected, the informa-                                                    tances between tokens 3 by the L2 norm of the difference
tion about neighboring tokens is essential to predict mor-                                                    of the transformed vectors. Probing produces the approx-
phosyntactic functions of words correctly. It is especially                                                   imate syntactic pairwise distances for each pair of tokens.
true for the homographs, which can have various part of                                                       The minimum spanning tree algorithm is used on the dis-
speech in different places in the text.                                                                       tance matrix to find the undirected dependency tree. The
                                                                                                              best configuration employs the 15th layer of BERT large
   The influence of RNN’s pre-training task is presented                                                      and induces treebank with 82.5% UAS on Penn Treebank
in Figure 3. Machine translation captures much better POS                                                     with Stanford Dependency annotation (relation directions
information than auto-encoders, which can be interpreted                                                      and punctuation were disregarded in the experiments). The
as translation from and to the same language. It is likely
that the latter task is straightforward and therefore does                                                          3 Tree distance is the length of the tree path between two tokens
result for BERT is significantly higher than for ELMo,
which gave 77.0% when the first layer was probed.
   The paper also describes an alternative method of ap-
proximating the syntactic depth by the L2 norm of la-
tent vector multiplied by a trainable matrix. The estimated
depths allow prediction of the root of a sentence with
90.1% accuracy when representation from the 16th layer
of BERT large is probed.

4.4   Multilingual Representations
The subsequent paper by Chi et al. [6] applies the set-
ting from [12] to the multilingual language model mBERT.
They train syntactic distance probes on 11 languages and
compare UAS of induced trees in four scenarios: 1. train-
ing and evaluating on the same languages; 2. training on
a single language, evaluating on a different one; 3. train-
ing on all languages except the evaluation one; 4. train-
ing on all languages, including the evaluation one. They
demonstrate that the transfer is effective as the results in all
the configurations outperform the baselines4 . Even in the               Figure 5: Two dimensional t-SNE visualization of probed
hardest case – zero-shot transfer from just one language,                mBERT embeddings from [6]. Analysis of the clusters
the result is at least 6.9 percent points above the base-                shows that embeddings encode information about the type
lines (for Chinese). Nevertheless, for all the languages, no             of dependency relations and, to a lesser extent, language.
transfer-learning setting can beat the training and evaluat-
ing a probe on the same language.
                                                                         with a possible syntactic interpretation – the weights of the
   The paper includes analysis of intrinsic features of the
                                                                         self-attention heads. In each head, information can flow
BERT’s vectors transformed by a probe. Noticeably, the
                                                                         from each token to any other one. These connections may
vector differences between the representations of words
                                                                         be easily analyzed and compared to syntactic relations pro-
connected by dependency relation are clustered by relation
                                                                         posed by linguists. In this section, we will summarize dif-
labels, see figure 5.
                                                                         ferent approaches of extracting syntax from attention. We
   Multilingual BERT embeddings are also analyzed by
                                                                         present the methods both for dependency and constituency
Wang et al. [36]. They show that even for the multilingual
                                                                         structures.
vectors, the results can be improved by projecting vector
spaces across languages. They use Biaffine Graph-based
Parser by Dozat and Manning [10], which consists of mul-                 5.1   Dependency Trees
tiple RNN layers. Therefore, the experiment is not strictly
                                                                         Raganato and Tiedemann [31] induce dependency trees
comparable with probing as the most of syntactic informa-
                                                                         from self-attention matrices of a neural machine transla-
tion is captured by the parser, and not by the embeddings.
                                                                         tion encoder. They use the maximum spanning tree algo-
The article compares different types of vector representa-
                                                                         rithm to connect pairs of tokens with high attention. Gold
tions fed as an input to the parser. It is demonstrated that
                                                                         root information is used to find the direction of the edges.
cross-lingual transformation on mBERT embedding im-
                                                                         Trees extracted in this way are generally worse than the
proves the results significantly in LAS of parser trained
                                                                         right-branching baseline (35.08 % UAS on PUD) and out-
on English and evaluated on 14 languages (including En-
                                                                         perform it slightly in a few heads. The maximum UAS
glish); on average, from 60.53% to 63.54%. In compar-
                                                                         is obtained when a dependency structure is induced from
ison to other cross-lingual representations, the proposed
                                                                         one head of the 5th layer of English to Chinese encoder
method outperforms transformed static embeddings (Fast-
                                                                         - 38.87% UAS. Nevertheless, their approach assumes that
Text with SVD) and also slightly outperforms contextual
                                                                         the whole syntactic tree may be induced from just one at-
embeddings (XLM).
                                                                         tention head.
                                                                            Recent articles focused on the analysis of features and
5     Syntax in Transformer’s Attention                                  classification of Transformer’s self-attention heads. Vig
      Matrices                                                           and Belinkov [34] apply multiple metrics to examine prop-
                                                                         erties of attention matrices computed in a unidirectional
Besides the vector representations of individual tokens,                 language model (GPT-2 [30]). They showed that in some
the Transformer architecture offers another representation               heads, the attentions concentrate on tokens representing
   4 There are two baselines: right-branching tree and probing on ran-   specific POS tags and the pairs of tokens are more often
domly initialized mBERT without pretraining                              attended one to another if an edge in the dependency tree
  Research                         Transformer Model               Type of tree     Syntactic           Evaluation data            Percentage
                                                                                    evaluation                                     of syntactic
                                                                                                                                   heads
  Raganato and                     NMT Encoder                     Dependency       Tree induction      PUD [27]                   0% - 8%5
  Tiedemann 2019 [31]              (6 layers 8 heads)
  Vig and Belinkov 2019            LM (GPT-2)                      Dependency       Dependency          Wikipedia (automati-       —
  [34]                                                                              Alignment           cally annotated)
  Clark et al. 2019 [7]            LM (BERT)                       Dependency       Dependency          WSJ Penn Treebank          —
                                                                                    Accuracy,           [20]
                                                                                    Tree induction
  Voita et al. 2019 [35]           NMT Encoder                     Dependency       Dependency          WMT, OpenSubtitles         15% - 19%
                                   (6 layers 8 heads)                               Accuracy            [16] (both automati-
                                                                                                        cally annotated)
  Limisiewicz et al. 2020          LMs                             Dependency       Dependency          PUD [27], EuroParl         46%
  [15]                             (BERT, mBERT)                                    Accuracy,           [14]      (automatically
                                                                                    Tree induction      annotated)
  Mareček and Rosa 2019           NMT Encoder                     Constituency     Tree induction      EuroParl [14] (automat-    19% - 33%
  [21]                             (6 layers 16 heads)                                                  ically annotated)
  Kim et al. 2019 [13]             LMs (BERT, GPT2,                Constituency     Tree induction      WSJ Penn Treebank          —
                                   RoBERTa, XLNet)                                                      [20], MNLI [37]

                      Table 1: Summary of syntactic properties observed in Transformer’s self-attention heads


connects them, i.e., dependency alignment is high. They                           They identify syntactic heads that significantly outperform
observe that the strongest dependency alignment occurs in                         positional baseline for the following labels: prepositional
the middle layers of the model – 4th and 5th. They also                           object, determiner, direct object, possession modifier, aux-
point that different dependency types (labels) are captured                       iliary passive, clausal component, marker, phrasal verb
in different places of the model. Attention in upper lay-                         particle. The syntactic heads are found in the middle layers
ers aligns more with subject relations whereas in the lower                       (4th to 8th). However, there is no single head that would
layer with modifying relations, such as auxiliaries, deter-                       capture the information for all the relations.
miners, conjunctions, and expletives.                                                In another experiment, Clark et al. [7] induce a depen-
   Voita et al. [35] also observed alignment with depen-                          dency tree from attentions. Instead of extracting structure
dency relations in the encoders of neural machine transla-                        from each head [31] they use probing to find the weighted
tion systems from English to Russian, German, or French.                          average of all heads. The maximum spanning tree algo-
They have evaluated dependency accuracy for four depen-                           rithm is used to induce the dependency structure from the
dency labels: noun subject, direct object, adjective mod-                         average. This approach produces trees with 61% UAS and
ifier, and adverbial modifier. They separately address the                        can be improved to 77% by making weights dependent on
cases where a verb attends to a dependent subject, and sub-                       the static word representation (fixed GloVe vectors). Both
ject attends to governor verb. The heads with more than                           the numbers are significantly higher than right branching
10% improvement over a positional baseline are identified                         baseline 27%.
as syntactic 6 . Such heads are found in all encoder lay-                            A related analysis for English (BERT) and the multilin-
ers except the first one. In further experiments, the authors                     gual variant (mBERT) was conducted by Limisiewicz et
propose the algorithm to prune the heads from the model                           al. [15]. We have observed that the information about one
with a minimal decrease in translation performance. Dur-                          dependency type is split across many self-attention heads
ing pruning, the share of syntactic heads rises from 17%                          and in other cases, the opposite happens - many heads have
in the original model to 40% when 75% heads are cut out,                          the same syntactic function. They extract labeled depen-
while a change in translation score is negligible. These                          dency trees from the averaged heads and achieves 52%
results support the claim that the model’s ability to cap-                        UAS and show that in the multilingual model (mBERT)
ture syntax is essential to its performance in non-syntactic                      specific relation (noun subject, determines) are found in
tasks.                                                                            the same heads across typologically similar languages.
   A similar evaluation of dependency accuracy for the
BERT language model was conducted by Clark et al. [7].                            5.2   Constituency trees
    5 A head is syntactic when the tree extracted from it surpasses the           There are fewer papers devoted to deriving constituency
right-branching chain in terms of UAS. It is a strong baseline for syntactic      syntax tree structures.
trees in English. Thus only a few heads are recognized as syntactic.
     6 In the positional baseline, the most frequent offset is added to the in-      Mareček and Rosa [21] examined the encoder of the
dex of relation’s dependent/governor to find its governor/dependent, e.g.,        machine translation system for translation between En-
for adjective to noun relations the most frequent offset is +1 in English         glish, French, and German. We observed that in some
                                          LAYER: 6 HEAD: 8                                   LAYER: 4 HEAD: 13
                      there                                                    huge
                            is                                                areas
           considerable               X                                  covering
                   energy                                             thousands
                    saving                X                                         of
                                                                         hectares
                potential                                                           of
                           in                                          vineyards
                    public                    X                                have
               buildings                                                       been
                                                                           burned
AMOD D2P


                              ,                                                       ;
                          for                                                    this
                 example                                                    means
                              ,                                                 that
                     which                                                        the
                     would                                                      vine
                facilitate                                                            -
                                                                          growers
                         the                                                   have
              transition                                                  suffered
                 towards                                                        loss
                             a                                                   and
                     stable                                   X                 that
                                                                               their
                              ,                                              plants
                     green                                    X                have
                economy                                                        been
                              .                                         damaged
                                                                                      .


                                                                                                          -
                                                                                                          ;


                                                                                                          .
                                                                                                        of
                                                                                                        of


                                                                                                    loss
                                                                                                    vine
                                                                                                      the
                                                                                                     this


                                                                                                     and
                                          LAYER: 8 HEAD: 10


                                                                                                    that


                                                                                                    that
                                                                                                   have


                                                                                                   have


                                                                                                   have
                                                                                                   been


                                                                                                   been
                                                                                                   huge


                                                                                                   their
                                                                                                  areas


                                                                                                 plants
                                                                                                means
                                                                                               burned


                                                                                              suffered
                                                                                             covering


                                                                                              growers


                                                                                            damaged
                                                                                           vineyards
                                                                                             hectares
                                                                                          thousands
                      there
                            is
           considerable
                   energy
                    saving
                potential         X
                           in                                       Figure 7: Balustrades observed in NMT’s encoder tend to
                    public
               buildings                                            overlap with syntactic phrases.
                              ,
OBJ D2P


                          for
                 example
                              ,
                     which
                     would                                          by up to 8.10 percent points in English.
                facilitate
                         the                                           The extraction of constituency trees from language
              transition                             X
                 towards                                            models was described by Kim et al. [13]. They present
                             a
                     stable                                         a comprehensive study that covers nine types of pre-
                              ,
                     green                                          trained networks: BERT (base, large), GPT-2 [30] (orig-
                economy                                             inal, medium), RoBERTa [19] (base, large), XLNet [38]
                              .
                                                                    (base, large). Their approach is based on computing dis-
                                                     ,

                                                     ,


                                                     ,

                                                     .
                                                    a
                                                   is


                                                  in

                                                 for


                                                the
                                            which
                                            would
                                           public
                                           saving
                                             there


                                            green
                                            stable
                                          energy


                                       economy
                                        example


                                                                    tance between each pair of subsequent words. In each step,
                                      buildings


                                        towards
                                       facilitate
                                       potential


                                     transition
                                  considerable


                                                                    they are branching the tree in the place where the distance
                                                                    is the highest. The authors try three distance measures on
                                                                    the vector outputs of the encoder layer (cosine, L1, and L2
                                                                    distances for pairs of vectors) and two distance measures
Figure 6: Self-attentions in particular heads of a language         on the distributions of token’s attention (Jason-Shannon
model (BERT) aligns with dependency relation adjective              and Hellinger distances for pairs of distribution). In the
modifiers and objects. The gold relations are marked with           former case, distances are computed only per layer and in
Xs.                                                                 the latter case for each head and average of heads in one
                                                                    layer. The best setting achieves 40.1% F1 score on WSJ
                                                                    Penn Treebank. It uses XLNet-base and Helinger distance
heads, stretches of words attend to the same token form-            on averaged attentions in the 7th layer. Generally, attention
ing shapes similar to balustrades (Figure 7). Furthermore,          distribution distances perform better than vector ones. Au-
those stretches usually overlap with syntactic phrases. This        thors also observe that models trained on regular language
notion is employed in the new method for constituency tree          modeling objective (i.e., next word prediction in GPT, XL-
induction. In their algorithm, the weights for each stretch         Net) captured syntax better than masked language models
of tokens are computed by summing the attention focused             (BERT, RoBERTa). In line with the previous research, the
on the balustrades and then inducing a constituency tree            middle layers tend to be more syntactic.
with CKY algorithm [26]. As a result, we produce trees
that achieve up to 32.8% F1 score for English sentences,
                                                                    5.3   Syntactic information across layers
43.6% for German and 44.2% for French. 7 The results can
be improved by selecting syntactic heads and using only             Figure 8 summarizes the evaluation of syntactic informa-
them in the algorithm. This approach requires a sample of           tion across layers for different approaches. In Transformer-
100 annotated sentences for head selection and raises F1            based language models: BERT, mBERT, and GPT-2, the
                                                                    middle layers are the most syntactic. In neural machine
    7 The evaluation was done on 1000 sentences for each language   translation models, the top layers of the encoder are the
parsed with supervised Stanford Parsed                              most syntactic. However, it is important to note that the
             5 6 7 8 9 10 11 12                                                                                                  1.0

                                                                                                                                 0.9

                                                                                                                                 0.8
            Layer number


                                                                                                                                 0.7

                                                                                                                                 0.6
                       4


                                                                                                                                 0.5
                       3
                       2


                                                                                                                                 0.4
                       1


                                                                                      fi
                                                        T-2


                                                                                                                      o
                                                                                    zh
                                                                                     tr
                                                                                     et


                                                                                    ru
                                                                                    de


                                                                                  en2


                                                                                                                  LM
                                   e


                                                                   ase


                                                                                 en2
                                                                                en2s
                                              ase


                                                                                 en2


                                                                                en2


                                                                                en2
                              bas
                                 2
                                24


                                                                                n2c
                                                     GP
                             1-1


                                                                    b


                                                                                                                  E
                                              Tb
                            13-


                                                                              e
                           RT


                                                                 RT
                                                    D)


                                                                                                               G)
                          ge


                                                                           MT
                                            ER
                         ge


                       BE


                                                              BE
                       lar
                     lar


                                        mB


                                                                            N
                   B)
                   RT


                                                              E)


                                                                         F)
                 RT


                                       C)
                BE
              BE
             A)


Figure 8: Relative syntactic information across attention models and layers. The values are normalized so that the best
layer for each method has 1.0. The methods A), B), C), and G) show undirected UAS trees extracted by probing the n-th
layer [12, 6]. The method D) shows the dependency alignment averaged across all heads in each layer [34]. The methods
E) and F) show UAS of trees induced from attention heads by the maximum spanning tree algorithm [31, 15]. The results
for the best layer (corresponding to value 1.0 in the plot) are: A) 82.5; B) 79.8; C) 80.1; D) 22.3; E) 24.3; F) en2cs: 23.9,
en2de: 20.9, en2et: 22.1, en2fi: 24.0, en2ru: 22.4, en2tr: 17.5, en2zh: 21.6; G) 77.0


NMT Transformer encoder is only the first half of the                         2. Pretraining on tasks with masked input (language
whole translation architecture, and therefore the most syn-                       modeling or machine translation) produces better
tactic layers are, in fact, in the middle of the process. In                      syntactic representation than auto encoding.
RNN language model (ELMo) the first layer is more syn-                        3. The advantage of language modeling over machine
tactic than the second one.                                                       translation is the fact that larger corpora are available
   We conjecture that the initial Transformer’s layers cap-                       for pretraining.
ture simple relations (e.g., attending to next or previous                  Our meta-analysis of latent states showed that the most
tokens) and the last layers mostly capture task-specific in-                syntactic representation could be found in the middle lay-
formation. Therefore, they are less syntactic.                              ers of the model. They tend to capture more complex re-
   We also observe that in supervised probing [12, 6], bet-                 lations than initial layers, and the representations are less
ter results are obtained from initial and top layers than in                dependent on the pretraining objectives than in the top lay-
unsupervised structure induction [31, 15], i.e., the distri-                ers.
bution across layers is smoother.                                              We have shown to what extent systems trained for a non-
                                                                            syntactic task can learn grammatical structures. The ques-
                                                                            tion we leave for further research is whether providing ex-
6   Conclusion                                                              plicit syntactic information to the model can improve its
                                                                            performance on other NLP tasks.
In this overview, we survey that syntactic structures are
latently learned by the neural models for natural language                  Acknowledgments
processing tasks. We have compared multiple approaches
of others and described the features that affect the ability to             This work has been supported by the grant 18-02196S of
capture the syntax. The following aspects tend to improve                   the Czech Science Foundation. It has been using language
the performance on syntactic tasks such as POS tagging:                     resources and tools developed, stored and distributed by
  1. Using contextual embeddings from RNNs or                               theLINDAT/CLARIAH-CZ project of the Ministry of Ed-
     Transformer outperforms static word embeddings                         ucation, Youth and Sports of the Czech Republic (project
     (Word2Vec, GloVe).                                                     LM2018101).
References                                                  [10] Timothy Dozat and Christopher D. Manning. Deep
                                                                 biaffine attention for neural dependency parsing. In
[1] Felipe Almeida and Geraldo Xexéo. Word embed-                5th International Conference on Learning Represen-
    dings: A survey. CoRR, abs/1901.09069, 2019.                 tations, ICLR 2017, Toulon, France, April 24-26,
                                                                 2017, Conference Track Proceedings, 2017.
[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua
                                                            [11] Zellig Harris. Distributional structure.       Word,
    Bengio. Neural machine translation by jointly learn-
                                                                 10(23):146–162, 1954.
    ing to align and translate. CoRR, abs/1409.0473,
    2015.                                                   [12] John Hewitt and Christopher D. Manning. A struc-
                                                                 tural probe for finding syntax in word representa-
[3] Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Has-           tions. In NAACL-HLT, 2019.
    san Sajjad, and James Glass. What do neural ma-
    chine translation models learn about morphology? In     [13] Taeuk Kim, Jihun Choi, Daniel Edmiston, and Sang-
    Proceedings of the 55th Annual Meeting of the As-            goo Lee. Are Pre-trained Language Models Aware of
    sociation for Computational Linguistics (Volume 1:           Phrases? Simple but Strong Baselines for Grammar
    Long Papers), pages 861–872, Vancouver, Canada,              Induction. In International Conference on Learning
    July 2017. Association for Computational Linguis-            Representations, January 2020.
    tics.                                                   [14] Philipp Koehn. Europarl: A parallel corpus for sta-
                                                                 tistical machine translation. 5, 11 2004.
[4] Yonatan Belinkov, Lluís Màrquez, Hassan Sajjad,
    Nadir Durrani, Fahim Dalvi, and James Glass. Eval-      [15] Tomasz Limisiewicz, Rudolf Rosa, and David
    uating layers of representation in neural machine            Mareček.   Universal dependencies according to
    translation on part-of-speech and semantic tagging           BERT: both more specific and more general. ArXiv,
    tasks. In Proceedings of the Eighth International            abs/2004.14620, 2020.
    Joint Conference on Natural Language Processing
                                                            [16] Pierre Lison, Jörg Tiedemann, and Milen Kouylekov.
    (Volume 1: Long Papers), pages 1–10, Taipei, Tai-
                                                                 OpenSubtitles2018: Statistical rescoring of sentence
    wan, November 2017. Asian Federation of Natural
                                                                 alignments in large, noisy parallel corpora. In Pro-
    Language Processing.
                                                                 ceedings of the Eleventh International Conference on
[5] Terra Blevins, Omer Levy, and Luke Zettlemoyer.              Language Resources and Evaluation (LREC 2018),
    Deep RNNs encode soft hierarchical syntax. In                Miyazaki, Japan, May 2018. European Language Re-
    Proceedings of the 56th Annual Meeting of the As-            sources Association (ELRA).
    sociation for Computational Linguistics (Volume 2:      [17] Nelson F. Liu, Matt Gardner, Yonatan Belinkov,
    Short Papers), pages 14–19, Melbourne, Australia,            Matthew E. Peters, and Noah A. Smith. Linguistic
    July 2018. Association for Computational Linguis-            knowledge and transferability of contextual represen-
    tics.                                                        tations. In NAACL-HLT, 2019.
[6] Ethan A. Chi, John Hewitt, and Christopher D. Man-      [18] Qi Liu, Matt J. Kusner, and Phil Blunsom. A survey
    ning. Finding universal grammatical relations in             on contextual embeddings. ArXiv, abs/2003.07278,
    multilingual BERT. In Proceedings of the 58th An-            2020.
    nual Meeting of the Association for Computational
                                                            [19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du,
    Linguistics, pages 5564–5577, Online, July 2020.
                                                                 Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
    Association for Computational Linguistics.
                                                                 Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A
[7] Kevin Clark, Urvashi Khandelwal, Omer Levy, and              robustly optimized bert pretraining approach. arXiv
    Christopher D. Manning. What does BERT look at?              preprint arXiv:1907.11692, 2019.
    An analysis of BERT’s attention, 2019.                  [20] Mitchell P. Marcus, Beatrice Santorini, and
                                                                 Mary Ann Marcinkiewicz. Building a large an-
[8] Scott Deerwester, Susan T. Dumais, George W.                 notated corpus of English: The Penn Treebank.
    Furnas, Thomas K. Landauer, and Richard Harsh-               Computational Linguistics, 19(2):313–330, 1993.
    man. Indexing by latent semantic analysis. Jour-
    nal of the American Society for Information Science,    [21] David Mareček and Rudolf Rosa. From balustrades
    41(6):391–407, 1990.                                         to pierre vinken: Looking for syntax in transformer
                                                                 self-attentions. In Proceedings of the 2019 ACL
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and                Workshop BlackboxNLP: Analyzing and Interpreting
    Kristina Toutanova. Bert: Pre-training of deep bidi-         Neural Networks for NLP, pages 263–275, Florence,
    rectional transformers for language understanding. In        Italy, August 2019. Association for Computational
    NAACL-HLT, 2019.                                             Linguistics.
[22] Bryan McCann, James Bradbury, Caiming Xiong,                  Phương Lê Hồng, Alessandro Lenci, Saran Lert-
     and Richard Socher. Learned in translation: Contex-           pradit, Herman Leung, Cheuk Ying Li, Josie Li,
     tualized word vectors. In Advances in Neural Infor-           Nikola Ljubešić, Olga Loginova, Olga Lyashevskaya,
     mation Processing Systems, pages 6297–6308, 2017.             Teresa Lynn, Vivien Macketanz, Aibek Makazhanov,
                                                                   Michael Mandl, Christopher Manning, Ruli Manu-
[23] Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-               rung, Cătălina Mărănduc, David Mareček, Katrin
     frey Dean. Efficient estimation of word represen-             Marheinecke, Héctor Martínez Alonso, André Mar-
     tations in vector space. CoRR, abs/1301.3781, July            tins, Jan Mašek, Yuji Matsumoto, Ryan McDon-
     2013.                                                         ald, Gustavo Mendonça, Anna Missilä, Verginica
                                                                   Mititelu, Yusuke Miyao, Simonetta Montemagni,
[24] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.               Amir More, Laura Moreno Romero, Shunsuke
     Linguistic regularities in continuous space word rep-         Mori, Bohdan Moskalevskyi, Kadri Muischnek,
     resentations. In Proceedings of the 2013 Confer-              Nina Mustafina, Kaili Müürisep, Pinkey Nainwani,
     ence of the North American Chapter of the Asso-               Anna Nedoluzhko, Lương Nguyễn Thị, Huyền
     ciation for Computational Linguistics: Human Lan-             Nguyễn Thị Minh, Vitaly Nikolaev, Rattima Ni-
     guage Technologies, pages 746–751, Atlanta, Geor-             tisaroj, Hanna Nurmi, Stina Ojala, Petya Osen-
     gia, June 2013. Association for Computational Lin-            ova, Lilja Øvrelid, Elena Pascual, Marco Pas-
     guistics.                                                     sarotti, Cenel-Augusto Perez, Guy Perrier, Slav
                                                                   Petrov, Jussi Piitulainen, Emily Pitler, Barbara Plank,
[25] Tomáš Musil. Examining Structure of Word Em-
                                                                   Martin Popel, Lauma Pretkalniņa, Prokopis Proko-
     beddings with PCA. In Text, Speech, and Dialogue,
                                                                   pidis, Tiina Puolakainen, Sampo Pyysalo, Alexan-
     pages 211–223. Springer International Publishing,
                                                                   dre Rademaker, Livy Real, Siva Reddy, Georg Rehm,
     2019.
                                                                   Larissa Rinaldi, Laura Rituma, Rudolf Rosa, Davide
[26] H. Ney. Dynamic programming parsing for context-              Rovati, Shadi Saleh, Manuela Sanguinetti, Baiba
     free grammars in continuous speech recognition.               Saulı̄te, Yanin Sawanakunanon, Sebastian Schus-
     IEEE Transactions on Signal Processing, 39(2):336–            ter, Djamé Seddah, Wolfgang Seeker, Mojgan Ser-
     340, 1991.                                                    aji, Lena Shakurova, Mo Shen, Atsuko Shimada,
                                                                   Muh Shohibussirri, Natalia Silveira, Maria Simi,
[27] Joakim Nivre, Željko Agić, Lars Ahrenberg, Lene             Radu Simionescu, Katalin Simkó, Mária Šimková,
     Antonsen, Maria Jesus Aranzabe, Masayuki Asa-                 Kiril Simov, Aaron Smith, Antonio Stella, Jana Str-
     hara, Luma Ateyah, Mohammed Attia, Aitziber                   nadová, Alane Suhr, Umut Sulubacak, Zsolt Szántó,
     Atutxa, Elena Badmaeva, Miguel Ballesteros, Esha              Dima Taji, Takaaki Tanaka, Trond Trosterud, Anna
     Banerjee, Sebastian Bank, John Bauer, Kepa Ben-               Trukhina, Reut Tsarfaty, Francis Tyers, Sumire Ue-
     goetxea, Riyaz Ahmad Bhat, Eckhard Bick, Cristina             matsu, Zdeňka Urešová, Larraitz Uria, Hans Uszko-
     Bosco, Gosse Bouma, Sam Bowman, Aljoscha Bur-                 reit, Gertjan van Noord, Viktor Varga, Veronika
     chardt, Marie Candito, Gauthier Caron, Gülşen Ce-           Vincze, Jonathan North Washington, Zhuoran Yu,
     biroğlu Eryiğit, Giuseppe G. A. Celano, Savas Cetin,        Zdeněk Žabokrtský, Daniel Zeman, and Hanzhi
     Fabricio Chalub, Jinho Choi, Yongseok Cho, Sil-               Zhu. Universal dependencies 2.0 – CoNLL 2017
     vie Cinková, Çağrı Çöltekin, Miriam Connor, Marie-        shared task development and test data, 2017. LIN-
     Catherine de Marneffe, Valeria de Paiva, Arantza              DAT/CLARIN digital library at the Institute of For-
     Diaz de Ilarraza, Kaja Dobrovoljc, Timothy Dozat,             mal and Applied Linguistics (ÚFAL), Faculty of
     Kira Droganova, Marhaba Eli, Ali Elkahky, Tomaž              Mathematics and Physics, Charles University.
     Erjavec, Richárd Farkas, Hector Fernandez Al-
                                                              [28] Jeffrey Pennington, Richard Socher, and Christo-
     calde, Jennifer Foster, Cláudia Freitas, Katarína
                                                                   pher D. Manning. Glove: Global vectors for word
     Gajdošová, Daniel Galbraith, Marcos Garcia, Filip
                                                                   representation. In Empirical Methods in Natural
     Ginter, Iakes Goenaga, Koldo Gojenola, Memduh
                                                                   Language Processing (EMNLP), pages 1532–1543,
     Gökırmak, Yoav Goldberg, Xavier Gómez Guino-
                                                                   2014.
     vart, Berta Gonzáles Saavedra, Matias Grioni, Nor-
     munds Grūzı̄tis, Bruno Guillaume, Nizar Habash,         [29] Matthew E. Peters, Mark Neumann, Mohit Iyyer,
     Jan Hajič, Jan Hajič jr., Linh Hà Mỹ, Kim Harris,           Matt Gardner, Christopher Clark, Kenton Lee, and
     Dag Haug, Barbora Hladká, Jaroslava Hlaváčová,               Luke Zettlemoyer. Deep contextualized word rep-
     Petter Hohle, Radu Ion, Elena Irimia, Anders Jo-              resentations. In Proceedings of the 2018 Confer-
     hannsen, Fredrik Jørgensen, Hüner Kaşıkara, Hi-             ence of the North American Chapter of the Asso-
     roshi Kanayama, Jenna Kanerva, Tolga Kayade-                  ciation for Computational Linguistics: Human Lan-
     len, Václava Kettnerová, Jesse Kirchner, Natalia              guage Technologies, Volume 1 (Long Papers), New
     Kotsyba, Simon Krek, Sookyoung Kwak, Veronika                 Orleans, Louisiana, June 2018. Association for Com-
     Laippala, Lorenzo Lambertino, Tatiana Lando,                  putational Linguistics.
[30] Alec Radford, Jeff Wu, Rewon Child, David Luan,               Louisiana, June 2018. Association for Computational
     Dario Amodei, and Ilya Sutskever. Language models             Linguistics.
     are unsupervised multitask learners. 2019.
                                                              [38] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G.
[31] Alessandro Raganato and Jörg Tiedemann. An anal-             Carbonell, Ruslan Salakhutdinov, and Quoc V. Le.
     ysis of encoder representations in transformer-based          Xlnet: Generalized autoregressive pretraining for
     machine translation. In Proceedings of the 2018               language understanding. In NeurIPS, 2019.
     EMNLP Workshop BlackboxNLP: Analyzing and In-
     terpreting Neural Networks for NLP, pages 287–297,       [39] Kelly W. Zhang and Samuel R. Bowman. Language
     Brussels, Belgium, November 2018. Association for             modeling teaches you more syntax than translation
     Computational Linguistics.                                    does: Lessons learned through auxiliary task anal-
                                                                   ysis. In Proceedings of the 2018 EMNLP Work-
[32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob              shop BlackboxNLP: Analyzing and Interpreting Neu-
     Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz                ral Networks for NLP, November 2018.
     Kaiser, and Illia Polosukhin. Attention is all you
     need. In Advances in Neural Information Processing
     Systems 30: Annual Conference on Neural Informa-
     tion Processing Systems 2017, 4-9 December 2017,
     Long Beach, CA, USA, pages 5998–6008, 2017.

[33] Jesse Vig. A multiscale visualization of attention
     in the transformer model. In Proceedings of the
     57th Conference of the Association for Computa-
     tional Linguistics, ACL 2019, Florence, Italy, July
     28 - August 2, 2019, Volume 3: System Demonstra-
     tions, pages 37–42. Association for Computational
     Linguistics, 2019.

[34] Jesse Vig and Yonatan Belinkov. Analyzing the
     Structure of Attention in a Transformer Language
     Model. In Proceedings of the 2019 ACL Work-
     shop BlackboxNLP: Analyzing and Interpreting Neu-
     ral Networks for NLP, pages 63–76, Florence, Italy,
     August 2019. Association for Computational Lin-
     guistics.

[35] Elena Voita, David Talbot, Fedor Moiseev, Rico Sen-
     nrich, and Ivan Titov. Analyzing multi-head self-
     attention: Specialized heads do the heavy lifting, the
     rest can be pruned. In Proceedings of the 57th An-
     nual Meeting of the Association for Computational
     Linguistics, pages 5797–5808, Florence, Italy, July
     2019. Association for Computational Linguistics.

[36] Yuxuan Wang, Wanxiang Che, Jiang Guo, Yijia Liu,
     and Ting Liu. Cross-lingual bert transformation for
     zero-shot dependency parsing. Proceedings of the
     2019 Conference on Empirical Methods in Natu-
     ral Language Processing and the 9th International
     Joint Conference on Natural Language Processing
     (EMNLP-IJCNLP), 2019.

[37] Adina Williams, Nikita Nangia, and Samuel Bow-
     man. A broad-coverage challenge corpus for sen-
     tence understanding through inference. In Proceed-
     ings of the 2018 Conference of the North Ameri-
     can Chapter of the Association for Computational
     Linguistics: Human Language Technologies, Volume
     1 (Long Papers), pages 1112–1122, New Orleans,

</pre>