Syntax Representation in Word Embeddings and Neural Networks -A Survey

Syntax Representation in Word Embeddings and Neural Networks -A Survey TomaszLimisiewicz limisiewicz@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University DavidMareček marecek@ufal.mff.cuni.cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Syntax Representation in Word Embeddings and Neural Networks -A Survey CCF46114AC4375288A95666A8F50FCDC GROBID - A machine learning software for extracting information from scholarly documents

Neural networks trained on natural language processing tasks capture syntax even though it is not provided as a supervision signal. This indicates that syntactic analysis is essential to the understating of language in artificial intelligence systems. This overview paper covers approaches of evaluating the amount of syntactic information included in the representations of words for different neural network architectures. We mainly summarize research on English monolingual data on language modeling tasks and multilingual data for neural machine translation systems and multilingual language models. We describe which pre-trained models and representations of language are best suited for transfer to syntactic tasks. This section introduces several types of architectures that we will analyze in this work.

Static Word Embeddings

In the classical methods of language representation, each word is assigned a vector regardless of its current context. In the Latent Semantic Analysis [8], the representation was

Introduction

Modern methods of natural language processing (NLP) are based on complex neural network architectures, where language units are represented in a metric space [23,28,29,9,30]. Such a phenomenon allows us to express linguistic features (i.e., morphological, lexical, syntactic) mathematically.

The method of obtaining such representation and their interpretations were described in multiple overview works. Almeida and Xexéo surveyed different types of static word embeddings [1], and Liu et al. [18] focused on contextual representations found in the most recent neural models. Belinkov and Glass [4] surveyed the strategies of interpreting latent representation. Best to our knowledge, we are the first to focus on the syntactic and morphological abilities of the word representations. We also cover the latest approaches, which go beyond the interpretation of latent vectors and analyze the attentions present in state-of-theart Transformer models.

Vector Representations of Words

obtained by counting word frequency across documents on distinct subjects.

In more recent approaches, a shallow neural network is used to predict each word based on context (Word2Vec [23]) or approximate the frequency of coocurence for a pair of words (GloVe [28]). One explanation of the effectiveness of these algorithms is the distributional hypothesis [11]: "words that occur in the same contexts tend to have similar meanings".

Contextual Word Vectors in Recurrent Networks

The main disadvantage of the static word embeddings is that they do not take into account the context of words. This is especially an issue for languages rich in words that have multiple meanings.

The contextual embeddings introduced in [29] and [22] are able to encode both words and their contexts. They are based on recurrent neural networks (RNN) and are typically trained on language modeling or machine translation tasks using large text corpora. The outputs of the RNN layers are context-dependent representations that are proven to perform well when used as inputs for other NLP tasks with much less training data available.

Another improvement of context modeling was possible thanks to the attention mechanism [2]. It allowed passing the information from the most relevant part of the RNN encoder, instead of using only the contextual representation of the last token.

Contextual Representation in Transformers

The most recent and widely used architecture is the Transformer [32]. It consists of several (6 to 24) layers, and each token position in each layer has the ability to attend to any position in the previous layer using a self-attention mechanism. Training such architecture can be easily parallelized since individual tokens can be processed independently; their positions are encoded within the input embeddings. An example of visualization of attention distribution computed in Transformer trained for language modeling (BERT [9]) is presented in Figure 1.

In addition to vectors, Transformer includes latent representation in the form of self-attention weights, which are two-dimensional matrices. We summarize the research on the syntactic properties of attention weights in Section 5.

Measures of Syntactic Information

This sections describes the metrics used to evaluate syntactic information captured by the word embeddings and latent representation.

Syntactic Analogies

In the recent revival of word embeddings [23,28], a strong focus was put on examining the phenomenon of encoding analogies in multidimensional space. That is to say, the shift vector between pairs of analogous words is approximately constant, e.g., the pairs drinking -drank, swimming -swam in Figure 2.

Syntactic analogies of this type are particularly relevant for this overview. They include the following relations: adjective -adverb; singular -plural; adjective -comparative -superlative; verb -present participle -past participle. The syntactic analogy is usually evaluated on Google Analogy Test Set [23]. 1An evaluation example consists of two word pairs represented by the embeddings: (v 1 , v 2 ), (u 1 , u 2 ). We compute the analogy shift vector as the difference between embeddings of the first pair s = v 2 − v 1 . The result is positive if the nearest word embedding to the vector u 1 + s is u 2 .

WA = |{(v 1 , v 2 , u 1 , u 2 ) : u 2 ≈ u 1 + v 2 − v 1 }| |{(v 1 , v 2 , u 1 , u 2 )}| (1)

Sequence Tagging

Sequence tagging is a multiclass classification problem.

The aim is to predict the correct tag for each token of a sequence. A typical example is the part of speech (POS) tagging. The accuracy evaluation is straightforward: the number of correctly assigned tags is divided by the number of tokens.

Syntactic structure prediction

The inference of reasonable syntactic structures from word representations is the most challenging task covered in our survey. There are attempts to predict both the dependency [12,31,15,7] and constituency trees [21,13]. Dependency trees are evaluated using unlabeled attachment score (UAS) or its undirected variant (UUAS):

UAS = #correctly_attached_words #all_words(2)

The equation for Labeled Attachment Score is the same, but it requires predicting a dependency label for each edge.

For constituency, trees we define precision (P) and recall (R) for correctly predicted phrases.

P = #correct_phrases #gold_phrases , R = #correct_phrases #predicted_phrases(3)

Usually, F1 is reported, which is a harmonic mean of precision and recall.

Attention's Dependency Alignment

In Section 5 we describe the examination of syntactic properties of self-attention matrices. It can be evaluated using Dependency Alignment [34] which sums the attention weights at the positions corresponding to the pairs of tokens forming a dependency edge in the tree.

DepAl A = ∑ (i, j)∈E A i, j ∑ N i=1 ∑ N j=1 A i, j(4)

Dependency Accuracy [35,7,15] is an alternative metric; for each dependency label it measures how often the relation's governor/dependent is the most attended token by the dependent/governor.

DepAcc l,d,A = |{(i, j) ∈ E l,d : j = arg max A i,• }| |E l,d |(5)

Notation: E is a set of all dependency tree edges and E l,d is a subset of the edges with the label l and with direction d, i.e., in dependent to governor direction the first element of the tuple i is dependent of the relation and the second element j is the governor; A is a self-attention matrix and A i,• denotes i th row of the matrix; N is the sequence length.

Morphology and Syntax in Word Embeddings and Latent Vectors

In this section, we summarize the research on the syntactic information captured by vector representations of words.

We devote a significant attention to POS tagging, which is a popular evaluation objective. Even though it is a morphological task, it is highly relevant to syntactic analysis.

Syntactic Analogies

The first wave of research on the vector representation of words focused on the statistical distribution of words across distinct topics -Latent Semantic Analysis [8]. It captured statistical properties of words, yet there were no positive results in syntactic analogies retrieval nor encoding syntax. Google Analogy Test Set was released together with a popular word embedding algorithm Word2Vec [23]. One of the exceptional properties of this method was its high accuracy in the analogy tasks. In particular, the best configuration found the correct syntactic analogy in 68.9 % of cases.

The GloVe embeddings improved the results on syntactic analogies to 69.3% [28]. Much more significant improvement was reported for semantic analogies. They also outperform the variety of other vectorization methods.

In [24] a simple recurrent neural network was trained by language modeling objective. The word representation is taken from the input layer. The evaluation from [23] shows that Word2Vec performs better in syntactic analogy task. This observation is surprising because representations from RNN were proven effective in transfer to other syntactic tasks (we elaborate on that in Sections 4.2 and 4.3). We think that possible explanations could be: 1. the techniques of RNN training have crucially improved in recent years; 2. syntactic analogy focuses on particular words, while for other syntactic tasks, the context is more important.

Part of Speech Tagging

Measuring to what extent a linguistic feature such as POS is captured in word representations is usually performed by the method called probing. In probing, the parameters of the pretrained network are fixed, the output word representations are computed as in the inference mode and then fed to a simple neural layer. Only this simple layer is optimized for a new task.

The number of probing experiments rose with the advent of multilayer2 RNNs trained for language modeling and machine translation.

Belinkov et al. [3] probe a recurrent neural machine translation (NMT) system with four layers to predict part of speech tags (along with morphological features). They use Arabic, Hebrew, French, German, and Czech to English pairs. They observe that adding a character-based representation computed by a convolutional neural network in addition to word-embedding input is beneficial, especially for morphologically rich languages.

In a subsequent study [4], the source language of translation now is English and the experiments are conducted solely for this language. It is noted that the most morphosyntactic representation is usually obtained in the middle layers of the network.

The influence of using a particular objective in pretraining RNN model is comprehensively analyzed by Blevins et al. [5]. They pre-train models on four objectives: syntactic parsing, semantic role labeling, machine translation, and language modeling. The two former objectives may reveal morphosyntactic information to a larger extent than other mentioned here settings. Particularly, the probe of RNN syntactic parser achieves near-perfect accuracy in part of speech tagging.

The introduction of ELMo [29] brought a remarkable advancement in transfer learning from the RNN language model to a variety of other NLP tasks. The authors examined POS capabilities of the representations and compared the results with the neural machine translation system CoVe [22], which also uses RNN architecture.

Zhang et al. [39] perform further experiments with CoVe and ELMo. They demonstrate that language modeling systems are better suited to capture morphology and syntax in the hidden states than machine translation, if comparable amounts of data are used to train both systems. Moreover, the corpora for language modeling are typically more extensive than for machine translation, which can further improve the results.

Another comprehensive evaluation of morphological and syntactic capabilities of language models was conducted by Liu et al. [17]. Probing was applied to a language model based on the Transformer architecture (BERT) and compared with ELMo and static word embeddings (Word2Vec). They observe that the hidden states of Transformer do not demonstrate a major increase in probed POS accuracy over the RNN model, even though it is more complex and consists of a larger number of parameters.

POS tag probing was also performed for languages other than English. For instance, Musil [25] trains translation systems (with RNN and Transformer architecture) from Czech to English and examines the learned input embeddings of the model and compares them to a Word2Vec model trained on Czech. In Figures 3 and 4, we present a comparison of different settings for POS tag probing. Each point denotes a pair of results obtained in the same paper and the same dataset, but with different types of embeddings or pretraining objectives. Therefore, we can observe that the setting plotted on the y-axis is better than the x-axis setting if the points are above identity function (red dashed line). We cannot say whether a method represented by another point performs better, as the evaluation settings differ.

Figure 4 clearly shows that the RNN contextualization helps in part of speech tagging. As expected, the information about neighboring tokens is essential to predict morphosyntactic functions of words correctly. It is especially true for the homographs, which can have various part of speech in different places in the text.

The influence of RNN's pre-training task is presented in Figure 3. Machine translation captures much better POS information than auto-encoders, which can be interpreted as translation from and to the same language. It is likely that the latter task is straightforward and therefore does not require to encode morphosyntax in the latent space. The difference between the results of machine translation and language modeling is small. Zhang et al. [39] show that using a larger corpus for pre-training improves the POS accuracy. The main advantage of language models is that monolingual data is much easier to obtain than parallel sentences necessary to train a machine translation system.

Syntactic Structure Induction

Extraction of dependency structure is more demanding because instead of prediction for single tokens, every pair of words need to be evaluated.

Blevins et al. [5] propose a feed-forward layer on top of a frozen RNN representation to predict whether a dependency tree edge connects a pair of tokens. They concatenate the vector representation of each of the words and their element-wise product. Such a representation is fed as an input to the binary classifier. It only looks on a pair of tokens at a time, therefore predicted edges may not form a valid tree.

Another approach, induction of the whole syntactic structures from latent representations was proposed by Hewitt and Manning [12]. Their syntactic probing is based on training a matrix which is used to transform the output of network's layers (they use BERT and ELMo). The objective of the probing is to approximate dependency tree distances between tokens 3 by the L2 norm of the difference of the transformed vectors. Probing produces the approximate syntactic pairwise distances for each pair of tokens. The minimum spanning tree algorithm is used on the distance matrix to find the undirected dependency tree. The best configuration employs the 15th layer of BERT large and induces treebank with 82.5% UAS on Penn Treebank with Stanford Dependency annotation (relation directions and punctuation were disregarded in the experiments). The result for BERT is significantly higher than for ELMo, which gave 77.0% when the first layer was probed.

The paper also describes an alternative method of approximating the syntactic depth by the L2 norm of latent vector multiplied by a trainable matrix. The estimated depths allow prediction of the root of a sentence with 90.1% accuracy when representation from the 16th layer of BERT large is probed.

Multilingual Representations

The subsequent paper by Chi et al. [6] applies the setting from [12] to the multilingual language model mBERT. They train syntactic distance probes on 11 languages and compare UAS of induced trees in four scenarios: 1. training and evaluating on the same languages; 2. training on a single language, evaluating on a different one; 3. training on all languages except the evaluation one; 4. training on all languages, including the evaluation one. They demonstrate that the transfer is effective as the results in all the configurations outperform the baselines 4 . Even in the hardest case -zero-shot transfer from just one language, the result is at least 6.9 percent points above the baselines (for Chinese). Nevertheless, for all the languages, no transfer-learning setting can beat the training and evaluating a probe on the same language.

The paper includes analysis of intrinsic features of the BERT's vectors transformed by a probe. Noticeably, the vector differences between the representations of words connected by dependency relation are clustered by relation labels, see figure 5.

Multilingual BERT embeddings are also analyzed by Wang et al. [36]. They show that even for the multilingual vectors, the results can be improved by projecting vector spaces across languages. They use Biaffine Graph-based Parser by Dozat and Manning [10], which consists of multiple RNN layers. Therefore, the experiment is not strictly comparable with probing as the most of syntactic information is captured by the parser, and not by the embeddings. The article compares different types of vector representations fed as an input to the parser. It is demonstrated that cross-lingual transformation on mBERT embedding improves the results significantly in LAS of parser trained on English and evaluated on 14 languages (including English); on average, from 60.53% to 63.54%. In comparison to other cross-lingual representations, the proposed method outperforms transformed static embeddings (Fast-Text with SVD) and also slightly outperforms contextual embeddings (XLM).

Syntax in Transformer's Attention Matrices

Besides the vector representations of individual tokens, the Transformer architecture offers another representation 4 There are two baselines: right-branching tree and probing on randomly initialized mBERT without pretraining Figure 5: Two dimensional t-SNE visualization of probed mBERT embeddings from [6]. Analysis of the clusters shows that embeddings encode information about the type of dependency relations and, to a lesser extent, language. with a possible syntactic interpretation -the weights of the self-attention heads. In each head, information can flow from each token to any other one. These connections may be easily analyzed and compared to syntactic relations proposed by linguists. In this section, we will summarize different approaches of extracting syntax from attention. We present the methods both for dependency and constituency structures.

Dependency Trees

Raganato and Tiedemann [31] induce dependency trees from self-attention matrices of a neural machine translation encoder. They use the maximum spanning tree algorithm to connect pairs of tokens with high attention. Gold root information is used to find the direction of the edges. Trees extracted in this way are generally worse than the right-branching baseline (35.08 % UAS on PUD) and outperform it slightly in a few heads. The maximum UAS is obtained when a dependency structure is induced from one head of the 5th layer of English to Chinese encoder -38.87% UAS. Nevertheless, their approach assumes that the whole syntactic tree may be induced from just one attention head.

Recent articles focused on the analysis of features and classification of Transformer's self-attention heads. Vig and Belinkov [34] apply multiple metrics to examine properties of attention matrices computed in a unidirectional language model (GPT-2 [30]). They showed that in some heads, the attentions concentrate on tokens representing specific POS tags and the pairs of tokens are more often attended one to another if an edge in the dependency tree [35] also observed alignment with dependency relations in the encoders of neural machine translation systems from English to Russian, German, or French. They have evaluated dependency accuracy for four dependency labels: noun subject, direct object, adjective modifier, and adverbial modifier. They separately address the cases where a verb attends to a dependent subject, and subject attends to governor verb. The heads with more than 10% improvement over a positional baseline are identified as syntactic 6 . Such heads are found in all encoder layers except the first one. In further experiments, the authors propose the algorithm to prune the heads from the model with a minimal decrease in translation performance. During pruning, the share of syntactic heads rises from 17% in the original model to 40% when 75% heads are cut out, while a change in translation score is negligible. These results support the claim that the model's ability to capture syntax is essential to its performance in non-syntactic tasks.

A similar evaluation of dependency accuracy for the BERT language model was conducted by Clark et al. [7].

They identify syntactic heads that significantly outperform positional baseline for the following labels: prepositional object, determiner, direct object, possession modifier, auxiliary passive, clausal component, marker, phrasal verb particle. The syntactic heads are found in the middle layers (4th to 8th). However, there is no single head that would capture the information for all the relations.

In another experiment, Clark et al. [7] induce a dependency tree from attentions. Instead of extracting structure from each head [31] they use probing to find the weighted average of all heads. The maximum spanning tree algorithm is used to induce the dependency structure from the average. This approach produces trees with 61% UAS and can be improved to 77% by making weights dependent on the static word representation (fixed GloVe vectors). Both the numbers are significantly higher than right branching baseline 27%.

A related analysis for English (BERT) and the multilingual variant (mBERT) was conducted by Limisiewicz et al. [15]. We have observed that the information about one dependency type is split across many self-attention heads and in other cases, the opposite happens -many heads have the same syntactic function. They extract labeled dependency trees from the averaged heads and achieves 52% UAS and show that in the multilingual model (mBERT) specific relation (noun subject, determines) are found in the same heads across typologically similar languages.

Constituency trees

There are fewer papers devoted to deriving constituency syntax tree structures.

Mareček and Rosa [21] examined the encoder of the machine translation system for translation between English, French, and German. We observed that in some heads, stretches of words attend to the same token forming shapes similar to balustrades (Figure 7). Furthermore, those stretches usually overlap with syntactic phrases. This notion is employed in the new method for constituency tree induction. In their algorithm, the weights for each stretch of tokens are computed by summing the attention focused on the balustrades and then inducing a constituency tree with CKY algorithm [26]. As a result, we produce trees that achieve up to 32.8% F1 score for English sentences, 43.6% for German and 44.2% for French. 7 The results can be improved by selecting syntactic heads and using only them in the algorithm. This approach requires a sample of 100 annotated sentences for head selection and raises F1 The extraction of constituency trees from language models was described by Kim et al. [13]. They present a comprehensive study that covers nine types of pretrained networks: BERT (base, large), GPT-2 [30] (original, medium), RoBERTa [19] (base, large), XLNet [38] (base, large). Their approach is based on computing distance between each pair of subsequent words. In each step, they are branching the tree in the place where the distance is the highest. The authors try three distance measures on the vector outputs of the encoder layer (cosine, L1, and L2 distances for pairs of vectors) and two distance measures on the distributions of token's attention (Jason-Shannon and Hellinger distances for pairs of distribution). In the former case, distances are computed only per layer and in the latter case for each head and average of heads in one layer. The best setting achieves 40.1% F1 score on WSJ Penn Treebank. It uses XLNet-base and Helinger distance on averaged attentions in the 7th layer. Generally, attention distribution distances perform better than vector ones. Authors also observe that models trained on regular language modeling objective (i.e., next word prediction in GPT, XL-Net) captured syntax better than masked language models (BERT, RoBERTa). In line with the previous research, the middle layers tend to be more syntactic.

Syntactic information across layers

Figure 8 summarizes the evaluation of syntactic information across layers for different approaches. In Transformerbased language models: BERT, mBERT, and GPT-2, the middle layers are the most syntactic. In neural machine translation models, the top layers of the encoder are the most syntactic. However, it is important to note that the [12,6]. The method D) shows the dependency alignment averaged across all heads in each layer [34]. The methods E) and F) show UAS of trees induced from attention heads by the maximum spanning tree algorithm [31,15]. The results for the best layer (corresponding to value 1.0 in the plot) are: A) 82.5; B) 79.8; C) 80.1; D) 22.3; E) 24.3; F) en2cs: 23.9, en2de: 20.9, en2et: 22.1, en2fi: 24.0, en2ru: 22.4, en2tr: 17.5, en2zh: 21.6; G) 77.0 NMT Transformer encoder is only the first half of the whole translation architecture, and therefore the most syntactic layers are, in fact, in the middle of the process. In RNN language model (ELMo) the first layer is more syntactic than the second one.

We conjecture that the initial Transformer's layers capture simple relations (e.g., attending to next or previous tokens) and the last layers mostly capture task-specific information. Therefore, they are less syntactic.

We also observe that in supervised probing [12,6], better results are obtained from initial and top layers than in unsupervised structure induction [31,15], i.e., the distribution across layers is smoother.

Conclusion

In this overview, we survey that syntactic structures are latently learned by the neural models for natural language processing tasks. We have compared multiple approaches of others and described the features that affect the ability to capture the syntax. The following aspects tend to improve the performance on syntactic tasks such as POS tagging:

1. Using contextual embeddings from RNNs or Transformer outperforms static word embeddings (Word2Vec, GloVe).

2. Pretraining on tasks with masked input (language modeling or machine translation) produces better syntactic representation than auto encoding. 3. The advantage of language modeling over machine translation is the fact that larger corpora are available for pretraining. Our meta-analysis of latent states showed that the most syntactic representation could be found in the middle layers of the model. They tend to capture more complex relations than initial layers, and the representations are less dependent on the pretraining objectives than in the top layers.

We have shown to what extent systems trained for a nonsyntactic task can learn grammatical structures. The question we leave for further research is whether providing explicit syntactic information to the model can improve its performance on other NLP tasks.

Figure 1 :1Figure 1: Visualization of attention mechanism in Transformer architecture. It shows which parts of the text are important to compute the representation for the word "to". Created in BertViz framework [33].

Figure 2 :2Figure 2: Spatial distribution of word embeddings depends on syntactic roles of words (visualization created by Ashutosh Singh).

Figure 3 :3Figure 3: Accuracy of POS tag probing from RNN representation by the pre-training objective.

et al. 2017b[4] Blevins et al. 2018[5] Musil 2019[25] Liu et al. 2019[17]

Figure 4 :4Figure 4: Accuracy of POS tag probing from RNN latent vectors compared with static word embeddings

Figure 6 :6Figure 6: Self-attentions in particular heads of a language model (BERT) aligns with dependency relation adjective modifiers and objects. The gold relations are marked with Xs.

Figure 7 :7Figure 7: Balustrades observed in NMT's encoder tend to overlap with syntactic phrases.

Figure 8 :8Figure8: Relative syntactic information across attention models and layers. The values are normalized so that the best layer for each method has 1.0. The methods A), B), C), and G) show undirected UAS trees extracted by probing the n-th layer[12,6]. The method D) shows the dependency alignment averaged across all heads in each layer[34]. The methods E) and F) show UAS of trees induced from attention heads by the maximum spanning tree algorithm[31,15]. The results for the best layer (corresponding to value 1.0 in the plot) are: A) 82.5; B) 79.8; C) 80.1; D) 22.3; E) 24.3; F) en2cs: 23.9, en2de: 20.9, en2et: 22.1, en2fi: 24.0, en2ru: 22.4, en2tr: 17.5, en2zh: 21.6; G) 77.0

Table 1 :1Summary of syntactic properties observed in Transformer's self-attention heads connects them, i.e., dependency alignment is high. They observe that the strongest dependency alignment occurs in the middle layers of the model -4th and 5th. They also point that different dependency types (labels) are captured in different places of the model. Attention in upper layers aligns more with subject relations whereas in the lower layer with modifying relations, such as auxiliaries, determiners, conjunctions, and expletives.Voita et al.ResearchTransformer ModelType of treeSyntacticEvaluation dataPercentageevaluationof syntacticheadsRaganato andNMT EncoderDependencyTree inductionPUD [27]0% -8% 5Tiedemann 2019 [31](6 layers 8 heads)Vig and Belinkov 2019LM (GPT-2)DependencyDependencyWikipedia (automati--[34]Alignmentcally annotated)Clark et al. 2019 [7]LM (BERT)DependencyDependencyWSJ Penn Treebank-Accuracy,[20]Tree inductionVoita et al. 2019 [35]NMT EncoderDependencyDependencyWMT, OpenSubtitles15% -19%(6 layers 8 heads)Accuracy[16] (both automati-cally annotated)Limisiewicz et al. 2020LMsDependencyDependencyPUD [27], EuroParl46%[15](BERT, mBERT)Accuracy,[14](automaticallyTree inductionannotated)Mareček and Rosa 2019NMT EncoderConstituencyTree inductionEuroParl [14] (automat-19% -33%[21](6 layers 16 heads)ically annotated)Kim et al. 2019 [13]LMs (BERT, GPT2,ConstituencyTree inductionWSJ Penn Treebank-RoBERTa, XLNet)[20], MNLI [37]

The test set is called syntactic by authors; nevertheless, it mostly focuses on morphological features. Layer numbering in this work: We are numbering layers starting from one for the layer closest to the input. Please note that original papers may use different numbering. A head is syntactic when the tree extracted from it surpasses the right-branching chain in terms of UAS. It is a strong baseline for syntactic trees in English. Thus only a few heads are recognized as syntactic. In the positional baseline, the most frequent offset is added to the index of relation's dependent/governor to find its governor/dependent, e.g., for adjective to noun relations the most frequent offset is +1 in English

Acknowledgments

This work has been supported by the grant 18-02196S of the Czech Science Foundation. It has been using language resources and tools developed, stored and distributed by theLINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2018101).

Word embeddings: A survey FelipeAlmeida GeraldoXexéo CoRR, abs/1901.09069 2019 Neural machine translation by jointly learning to align and translate DzmitryBahdanau KyunghyunCho YoshuaBengio CoRR, abs/1409.0473 2015 What do neural machine translation models learn about morphology YonatanBelinkov NadirDurrani FahimDalvi HassanSajjad JamesGlass Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics the 55th Annual Meeting of the Association for Computational Linguistics

Vancouver, Canada

Association for Computational Linguistics July 2017 1 : Long Papers) Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks YonatanBelinkov LluísMàrquez HassanSajjad NadirDurrani FahimDalvi JamesGlass Proceedings of the Eighth International Joint Conference on Natural Language Processing Long Papers the Eighth International Joint Conference on Natural Language Processing

Taipei, Taiwan

November 2017 1 Asian Federation of Natural Language Processing Deep RNNs encode soft hierarchical syntax TerraBlevins OmerLevy LukeZettlemoyer Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics the 56th Annual Meeting of the Association for Computational Linguistics

Melbourne, Australia

Association for Computational Linguistics July 2018 2 : Short Papers) Finding universal grammatical relations in multilingual BERT EthanAChi JohnHewitt ChristopherDManning Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics the 58th Annual Meeting of the Association for Computational Linguistics July 2020 Association for Computational Linguistics What does BERT look at? An analysis of BERT's attention KevinClark UrvashiKhandelwal OmerLevy ChristopherDManning 2019 Indexing by latent semantic analysis ScottDeerwester SusanTDumais GeorgeWFurnas ThomasKLandauer RichardHarshman Journal of the American Society for Information Science 41 6 1990 Bert: Pre-training of deep bidirectional transformers for language understanding JacobDevlin Ming-WeiChang KentonLee KristinaToutanova NAACL-HLT 2019 Deep biaffine attention for neural dependency parsing TimothyDozat ChristopherDManning 5th International Conference on Learning Representations, ICLR 2017

Toulon, France

April 24-26, 2017. 2017 Conference Track Proceedings Distributional structure ZelligHarris Word 10 23 1954 A structural probe for finding syntax in word representations JohnHewitt ChristopherDManning NAACL-HLT 2019 Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction TaeukKim JihunChoi DanielEdmiston SanggooLee International Conference on Learning Representations January 2020 Europarl: A parallel corpus for statistical machine translation PhilippKoehn 2004 5 11 Universal dependencies according to BERT: both more specific and more general TomaszLimisiewicz RudolfRosa DavidMareček ArXiv, abs/2004.14620 2020 OpenSubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora PierreLison JörgTiedemann MilenKouylekov Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Miyazaki, Japan

ELRA May 2018 European Language Resources Association Linguistic knowledge and transferability of contextual representations NelsonFLiu MattGardner YonatanBelinkov MatthewEPeters NoahASmith NAACL-HLT 2019 A survey on contextual embeddings QiLiu MattJKusner PhilBlunsom ArXiv, abs/2003.07278 2020 Roberta: A robustly optimized bert pretraining approach YinhanLiu MyleOtt NamanGoyal JingfeiDu MandarJoshi DanqiChen OmerLevy MikeLewis LukeZettlemoyer VeselinStoyanov arXiv:1907.11692 2019 arXiv preprint Building a large annotated corpus of English: The Penn Treebank MitchellPMarcus BeatriceSantorini MaryAnnMarcinkiewicz Computational Linguistics 19 2 1993 From balustrades to pierre vinken: Looking for syntax in transformer self-attentions DavidMareček RudolfRosa Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Florence, Italy

August 2019 Association for Computational Linguistics Learned in translation: Contextualized word vectors BryanMccann JamesBradbury CaimingXiong RichardSocher Advances in Neural Information Processing Systems 2017 Efficient estimation of word representations in vector space TomasMikolov KaiChen GregCorrado JeffreyDean CoRR, abs/1301.3781 July 2013 Linguistic regularities in continuous space word representations TomasMikolov Wen-TauYih GeoffreyZweig Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Atlanta, Georgia

Association for Computational Linguistics June 2013 Examining Structure of Word Embeddings with PCA TomášMusil Text, Speech, and Dialogue Springer International Publishing 2019 Dynamic programming parsing for contextfree grammars in continuous speech recognition HNey IEEE Transactions on Signal Processing 39 2 1991 Universal dependencies 2.0 -CoNLL 2017 shared task development and test data JoakimNivre ŽeljkoAgić LarsAhrenberg LeneAntonsen MariaJesus Aranzabe MasayukiAsahara LumaAteyah MohammedAttia AitziberAtutxa ElenaBadmaeva MiguelBallesteros EshaBanerjee SebastianBank JohnBauer KepaBengoetxea AhmadRiyaz EckhardBhat CristinaBick GosseBosco SamBouma AljoschaBowman MarieBurchardt Candito Gauthier Caron Güls ¸en Cebiroglu GiuseppeG AEryigit SavasCelano FabricioCetin JinhoChalub YongseokChoi SilvieCho CCinková Miriam¸agrı C ¸öltekin Marie-CatherineConnor ValeriaDe Marneffe ArantzaDe Paiva KajaDiaz De Ilarraza TimothyDobrovoljc KiraDozat MarhabaDroganova AliEli TomažElkahky RichárdErjavec HectorFarkas JenniferFernandez Alcalde CláudiaFoster KatarínaFreitas DanielGajdošová MarcosGalbraith FilipGarcia IakesGinter KoldoGoenaga MemduhGojenola YoavGökırmak XavierGoldberg BertaGonzálesGómez Guinovart MatiasSaavedra NormundsGrioni BrunoGrūzītis NizarGuillaume JanHabash JanHajič LinhHajič Jr KimHà Mỹ DagHarris BarboraHaug JaroslavaHladká PetterHlaváčová RaduHohle ElenaIon AndersIrimia FredrikJohannsen HünerJørgensen HiroshiKas ¸ıkara JennaKanayama TolgaKanerva VáclavaKayadelen JesseKettnerová NataliaKirchner SimonKotsyba SookyoungKrek VeronikaKwak LorenzoLaippala TatianaLambertino PhươngLando AlessandroLê H Ồng SaranLenci HermanLertpradit Leung YingCheuk JosieLi NikolaLi OlgaLjubešić OlgaLoginova TeresaLyashevskaya VivienLynn AibekMacketanz MichaelMakazhanov ChristopherMandl RuliManning CătălinaManurung DavidMărănduc KatrinMareček Marheinecke MartínezHéctor AndréAlonso JanMartins YujiMašek RyanMatsumoto GustavoMcdonald AnnaMendonc ¸a VerginicaMissilä YusukeMititelu SimonettaMiyao AmirMontemagni LauraMorenoMore ShunsukeRomero BohdanMori KadriMoskalevskyi NinaMuischnek KailiMustafina PinkeyMüürisep AnnaNainwani LươngNedoluzhko Nguy Ễn Thị Huy Ền Nguy Ễn Thị VitalyMinh RattimaNikolaev HannaNitisaroj StinaNurmi PetyaOjala LiljaOsenova ElenaØvrelid MarcoPascual Cenel-AugustoPassarotti GuyPerez SlavPerrier JussiPetrov EmilyPiitulainen BarbaraPitler MartinPlank LaumaPopel ProkopisPretkalnin ¸a TiinaProkopidis SampoPuolakainen AlexandrePyysalo LivyRademaker SivaReal GeorgReddy LarissaRehm LauraRinaldi RudolfRituma DavideRosa ShadiRovati ManuelaSaleh BaibaSanguinetti YaninSaulīte SebastianSawanakunanon DjaméSchuster WolfgangSeddah MojganSeeker LenaSeraji MoShakurova AtsukoShen MuhShimada NataliaShohibussirri MariaSilveira RaduSimi KatalinSimionescu MáriaSimkó KirilŠimková AaronSimov AntonioSmith JanaStella AlaneStrnadová UmutSuhr ZsoltSulubacak DimaSzántó TakaakiTaji TrondTanaka AnnaTrosterud ReutTrukhina FrancisTsarfaty SumireTyers ZdeňkaUematsu LarraitzUrešová HansUria GertjanUszkoreit ViktorVan Noord VeronikaVarga JonathanNorthVincze ZhuoranWashington ZdeněkYu DanielŽabokrtský HanzhiZeman Zhu LIN-DAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL) 2017 Faculty of Mathematics and Physics, Charles University Glove: Global vectors for word representation JeffreyPennington RichardSocher ChristopherDManning Empirical Methods in Natural Language Processing (EMNLP) 2014 Deep contextualized word representations MatthewEPeters MarkNeumann MohitIyyer MattGardner ChristopherClark KentonLee LukeZettlemoyer Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

New Orleans, Louisiana

Association for Computational Linguistics June 2018 1 Long Papers Language models are unsupervised multitask learners AlecRadford JeffWu RewonChild DavidLuan DarioAmodei IlyaSutskever 2019 An analysis of encoder representations in transformer-based machine translation AlessandroRaganato JörgTiedemann Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Brussels, Belgium

November 2018 Association for Computational Linguistics Attention is all you need AshishVaswani NoamShazeer NikiParmar JakobUszkoreit LlionJones AidanNGomez LukaszKaiser IlliaPolosukhin Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017

Long Beach, CA, USA

December 2017. 2017 A multiscale visualization of attention in the transformer model JesseVig Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019 the 57th Conference of the Association for Computational Linguistics, ACL 2019

Florence, Italy

Association for Computational Linguistics July 28 -August 2, 2019. 2019 3 System Demonstrations Analyzing the Structure of Attention in a Transformer Language Model JesseVig YonatanBelinkov Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Florence, Italy

Association for Computational Linguistics August 2019 Analyzing multi-head selfattention: Specialized heads do the heavy lifting, the rest can be pruned ElenaVoita DavidTalbot FedorMoiseev RicoSennrich IvanTitov Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics the 57th Annual Meeting of the Association for Computational Linguistics

Florence, Italy

July 2019 Association for Computational Linguistics Cross-lingual bert transformation for zero-shot dependency parsing YuxuanWang WanxiangChe JiangGuo YijiaLiu TingLiu Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) 2019 A broad-coverage challenge corpus for sentence understanding through inference AdinaWilliams NikitaNangia SamuelBowman Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

New Orleans, Louisiana

June 2018 1 Association for Computational Linguistics Xlnet: Generalized autoregressive pretraining for language understanding ZhilinYang ZihangDai YimingYang JaimeGCarbonell RuslanSalakhutdinov VQuoc Le 2019 NeurIPS Language modeling teaches you more syntax than translation does: Lessons learned through auxiliary task analysis KellyWZhang SamuelRBowman Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP November 2018