<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Syntax Representation in Word Embeddings and Neural Networks -A Survey</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Tomasz</forename><surname>Limisiewicz</surname></persName>
							<email>limisiewicz@ufal.mff.cuni.cz</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Institute of Formal and Applied Linguistics</orgName>
								<orgName type="department" key="dep2">Faculty of Mathematics and Physics</orgName>
								<orgName type="institution">Charles University</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">David</forename><surname>Mareček</surname></persName>
							<email>marecek@ufal.mff.cuni.cz</email>
							<affiliation key="aff0">
								<orgName type="department" key="dep1">Institute of Formal and Applied Linguistics</orgName>
								<orgName type="department" key="dep2">Faculty of Mathematics and Physics</orgName>
								<orgName type="institution">Charles University</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Syntax Representation in Word Embeddings and Neural Networks -A Survey</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">CCF46114AC4375288A95666A8F50FCDC</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T08:53+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Neural networks trained on natural language processing tasks capture syntax even though it is not provided as a supervision signal. This indicates that syntactic analysis is essential to the understating of language in artificial intelligence systems. This overview paper covers approaches of evaluating the amount of syntactic information included in the representations of words for different neural network architectures. We mainly summarize research on English monolingual data on language modeling tasks and multilingual data for neural machine translation systems and multilingual language models. We describe which pre-trained models and representations of language are best suited for transfer to syntactic tasks. This section introduces several types of architectures that we will analyze in this work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Static Word Embeddings</head><p>In the classical methods of language representation, each word is assigned a vector regardless of its current context. In the Latent Semantic Analysis [8], the representation was</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Modern methods of natural language processing (NLP) are based on complex neural network architectures, where language units are represented in a metric space <ref type="bibr" target="#b22">[23,</ref><ref type="bibr" target="#b27">28,</ref><ref type="bibr" target="#b28">29,</ref><ref type="bibr" target="#b8">9,</ref><ref type="bibr" target="#b29">30]</ref>. Such a phenomenon allows us to express linguistic features (i.e., morphological, lexical, syntactic) mathematically.</p><p>The method of obtaining such representation and their interpretations were described in multiple overview works. Almeida and Xexéo surveyed different types of static word embeddings <ref type="bibr" target="#b0">[1]</ref>, and Liu et al. <ref type="bibr" target="#b17">[18]</ref> focused on contextual representations found in the most recent neural models. Belinkov and Glass <ref type="bibr" target="#b3">[4]</ref> surveyed the strategies of interpreting latent representation. Best to our knowledge, we are the first to focus on the syntactic and morphological abilities of the word representations. We also cover the latest approaches, which go beyond the interpretation of latent vectors and analyze the attentions present in state-of-theart Transformer models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Vector Representations of Words</head><p>obtained by counting word frequency across documents on distinct subjects.</p><p>In more recent approaches, a shallow neural network is used to predict each word based on context (Word2Vec <ref type="bibr" target="#b22">[23]</ref>) or approximate the frequency of coocurence for a pair of words (GloVe <ref type="bibr" target="#b27">[28]</ref>). One explanation of the effectiveness of these algorithms is the distributional hypothesis <ref type="bibr" target="#b10">[11]</ref>: "words that occur in the same contexts tend to have similar meanings".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Contextual Word Vectors in Recurrent Networks</head><p>The main disadvantage of the static word embeddings is that they do not take into account the context of words. This is especially an issue for languages rich in words that have multiple meanings.</p><p>The contextual embeddings introduced in <ref type="bibr" target="#b28">[29]</ref> and <ref type="bibr" target="#b21">[22]</ref> are able to encode both words and their contexts. They are based on recurrent neural networks (RNN) and are typically trained on language modeling or machine translation tasks using large text corpora. The outputs of the RNN layers are context-dependent representations that are proven to perform well when used as inputs for other NLP tasks with much less training data available.</p><p>Another improvement of context modeling was possible thanks to the attention mechanism <ref type="bibr" target="#b1">[2]</ref>. It allowed passing the information from the most relevant part of the RNN encoder, instead of using only the contextual representation of the last token.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Contextual Representation in Transformers</head><p>The most recent and widely used architecture is the Transformer <ref type="bibr" target="#b31">[32]</ref>. It consists of several (6 to 24) layers, and each token position in each layer has the ability to attend to any position in the previous layer using a self-attention mechanism. Training such architecture can be easily parallelized since individual tokens can be processed independently; their positions are encoded within the input embeddings. An example of visualization of attention distribution computed in Transformer trained for language modeling (BERT <ref type="bibr" target="#b8">[9]</ref>) is presented in Figure <ref type="figure" target="#fig_0">1</ref>.</p><p>In addition to vectors, Transformer includes latent representation in the form of self-attention weights, which are two-dimensional matrices. We summarize the research on the syntactic properties of attention weights in Section 5.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Measures of Syntactic Information</head><p>This sections describes the metrics used to evaluate syntactic information captured by the word embeddings and latent representation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Syntactic Analogies</head><p>In the recent revival of word embeddings <ref type="bibr" target="#b22">[23,</ref><ref type="bibr" target="#b27">28]</ref>, a strong focus was put on examining the phenomenon of encoding analogies in multidimensional space. That is to say, the shift vector between pairs of analogous words is approximately constant, e.g., the pairs drinking -drank, swimming -swam in Figure <ref type="figure" target="#fig_1">2</ref>.</p><p>Syntactic analogies of this type are particularly relevant for this overview. They include the following relations: adjective -adverb; singular -plural; adjective -comparative -superlative; verb -present participle -past participle. The syntactic analogy is usually evaluated on Google Analogy Test Set <ref type="bibr" target="#b22">[23]</ref>. <ref type="foot" target="#foot_0">1</ref>An evaluation example consists of two word pairs represented by the embeddings: (v 1 , v 2 ), (u 1 , u 2 ). We compute the analogy shift vector as the difference between embeddings of the first pair s = v 2 − v 1 . The result is positive if the nearest word embedding to the vector u 1 + s is u 2 .</p><formula xml:id="formula_0">WA = |{(v 1 , v 2 , u 1 , u 2 ) : u 2 ≈ u 1 + v 2 − v 1 }| |{(v 1 , v 2 , u 1 , u 2 )}| (1)</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Sequence Tagging</head><p>Sequence tagging is a multiclass classification problem.</p><p>The aim is to predict the correct tag for each token of a sequence. A typical example is the part of speech (POS) tagging. The accuracy evaluation is straightforward: the number of correctly assigned tags is divided by the number of tokens.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Syntactic structure prediction</head><p>The inference of reasonable syntactic structures from word representations is the most challenging task covered in our survey. There are attempts to predict both the dependency <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b30">31,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b6">7]</ref> and constituency trees <ref type="bibr" target="#b20">[21,</ref><ref type="bibr" target="#b12">13]</ref>. Dependency trees are evaluated using unlabeled attachment score (UAS) or its undirected variant (UUAS):</p><formula xml:id="formula_1">UAS = #correctly_attached_words #all_words<label>(2)</label></formula><p>The equation for Labeled Attachment Score is the same, but it requires predicting a dependency label for each edge.</p><p>For constituency, trees we define precision (P) and recall (R) for correctly predicted phrases.</p><formula xml:id="formula_2">P = #correct_phrases #gold_phrases , R = #correct_phrases #predicted_phrases<label>(3)</label></formula><p>Usually, F1 is reported, which is a harmonic mean of precision and recall.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Attention's Dependency Alignment</head><p>In Section 5 we describe the examination of syntactic properties of self-attention matrices. It can be evaluated using Dependency Alignment <ref type="bibr" target="#b33">[34]</ref> which sums the attention weights at the positions corresponding to the pairs of tokens forming a dependency edge in the tree.</p><formula xml:id="formula_3">DepAl A = ∑ (i, j)∈E A i, j ∑ N i=1 ∑ N j=1 A i, j<label>(4)</label></formula><p>Dependency Accuracy <ref type="bibr" target="#b34">[35,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b14">15]</ref> is an alternative metric; for each dependency label it measures how often the relation's governor/dependent is the most attended token by the dependent/governor.</p><formula xml:id="formula_4">DepAcc l,d,A = |{(i, j) ∈ E l,d : j = arg max A i,• }| |E l,d |<label>(5)</label></formula><p>Notation: E is a set of all dependency tree edges and E l,d is a subset of the edges with the label l and with direction d, i.e., in dependent to governor direction the first element of the tuple i is dependent of the relation and the second element j is the governor; A is a self-attention matrix and A i,• denotes i th row of the matrix; N is the sequence length.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Morphology and Syntax in Word Embeddings and Latent Vectors</head><p>In this section, we summarize the research on the syntactic information captured by vector representations of words.</p><p>We devote a significant attention to POS tagging, which is a popular evaluation objective. Even though it is a morphological task, it is highly relevant to syntactic analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Syntactic Analogies</head><p>The first wave of research on the vector representation of words focused on the statistical distribution of words across distinct topics -Latent Semantic Analysis <ref type="bibr" target="#b7">[8]</ref>. It captured statistical properties of words, yet there were no positive results in syntactic analogies retrieval nor encoding syntax. Google Analogy Test Set was released together with a popular word embedding algorithm Word2Vec <ref type="bibr" target="#b22">[23]</ref>. One of the exceptional properties of this method was its high accuracy in the analogy tasks. In particular, the best configuration found the correct syntactic analogy in 68.9 % of cases.</p><p>The GloVe embeddings improved the results on syntactic analogies to 69.3% <ref type="bibr" target="#b27">[28]</ref>. Much more significant improvement was reported for semantic analogies. They also outperform the variety of other vectorization methods.</p><p>In <ref type="bibr" target="#b23">[24]</ref> a simple recurrent neural network was trained by language modeling objective. The word representation is taken from the input layer. The evaluation from <ref type="bibr" target="#b22">[23]</ref> shows that Word2Vec performs better in syntactic analogy task. This observation is surprising because representations from RNN were proven effective in transfer to other syntactic tasks (we elaborate on that in Sections 4.2 and 4.3). We think that possible explanations could be: 1. the techniques of RNN training have crucially improved in recent years; 2. syntactic analogy focuses on particular words, while for other syntactic tasks, the context is more important.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Part of Speech Tagging</head><p>Measuring to what extent a linguistic feature such as POS is captured in word representations is usually performed by the method called probing. In probing, the parameters of the pretrained network are fixed, the output word representations are computed as in the inference mode and then fed to a simple neural layer. Only this simple layer is optimized for a new task.</p><p>The number of probing experiments rose with the advent of multilayer<ref type="foot" target="#foot_1">2</ref> RNNs trained for language modeling and machine translation.</p><p>Belinkov et al. <ref type="bibr" target="#b2">[3]</ref> probe a recurrent neural machine translation (NMT) system with four layers to predict part of speech tags (along with morphological features). They use Arabic, Hebrew, French, German, and Czech to English pairs. They observe that adding a character-based representation computed by a convolutional neural network in addition to word-embedding input is beneficial, especially for morphologically rich languages.</p><p>In a subsequent study <ref type="bibr" target="#b3">[4]</ref>, the source language of translation now is English and the experiments are conducted solely for this language. It is noted that the most morphosyntactic representation is usually obtained in the middle layers of the network.</p><p>The influence of using a particular objective in pretraining RNN model is comprehensively analyzed by Blevins et al. <ref type="bibr" target="#b4">[5]</ref>. They pre-train models on four objectives: syntactic parsing, semantic role labeling, machine translation, and language modeling. The two former objectives may reveal morphosyntactic information to a larger extent than other mentioned here settings. Particularly, the probe of RNN syntactic parser achieves near-perfect accuracy in part of speech tagging.</p><p>The introduction of ELMo <ref type="bibr" target="#b28">[29]</ref> brought a remarkable advancement in transfer learning from the RNN language model to a variety of other NLP tasks. The authors examined POS capabilities of the representations and compared the results with the neural machine translation system CoVe <ref type="bibr" target="#b21">[22]</ref>, which also uses RNN architecture.</p><p>Zhang et al. <ref type="bibr" target="#b38">[39]</ref> perform further experiments with CoVe and ELMo. They demonstrate that language modeling systems are better suited to capture morphology and syntax in the hidden states than machine translation, if comparable amounts of data are used to train both systems. Moreover, the corpora for language modeling are typically more extensive than for machine translation, which can further improve the results.</p><p>Another comprehensive evaluation of morphological and syntactic capabilities of language models was conducted by Liu et al. <ref type="bibr" target="#b16">[17]</ref>. Probing was applied to a language model based on the Transformer architecture (BERT) and compared with ELMo and static word embeddings (Word2Vec). They observe that the hidden states of Transformer do not demonstrate a major increase in probed POS accuracy over the RNN model, even though it is more complex and consists of a larger number of parameters.</p><p>POS tag probing was also performed for languages other than English. For instance, Musil <ref type="bibr" target="#b24">[25]</ref> trains translation systems (with RNN and Transformer architecture) from Czech to English and examines the learned input embeddings of the model and compares them to a Word2Vec model trained on Czech.   In Figures <ref type="figure" target="#fig_4">3 and 4</ref>, we present a comparison of different settings for POS tag probing. Each point denotes a pair of results obtained in the same paper and the same dataset, but with different types of embeddings or pretraining objectives. Therefore, we can observe that the setting plotted on the y-axis is better than the x-axis setting if the points are above identity function (red dashed line). We cannot say whether a method represented by another point performs better, as the evaluation settings differ.</p><p>Figure <ref type="figure" target="#fig_4">4</ref> clearly shows that the RNN contextualization helps in part of speech tagging. As expected, the information about neighboring tokens is essential to predict morphosyntactic functions of words correctly. It is especially true for the homographs, which can have various part of speech in different places in the text.</p><p>The influence of RNN's pre-training task is presented in Figure <ref type="figure" target="#fig_2">3</ref>. Machine translation captures much better POS information than auto-encoders, which can be interpreted as translation from and to the same language. It is likely that the latter task is straightforward and therefore does not require to encode morphosyntax in the latent space. The difference between the results of machine translation and language modeling is small. Zhang et al. <ref type="bibr" target="#b38">[39]</ref> show that using a larger corpus for pre-training improves the POS accuracy. The main advantage of language models is that monolingual data is much easier to obtain than parallel sentences necessary to train a machine translation system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Syntactic Structure Induction</head><p>Extraction of dependency structure is more demanding because instead of prediction for single tokens, every pair of words need to be evaluated.</p><p>Blevins et al. <ref type="bibr" target="#b4">[5]</ref> propose a feed-forward layer on top of a frozen RNN representation to predict whether a dependency tree edge connects a pair of tokens. They concatenate the vector representation of each of the words and their element-wise product. Such a representation is fed as an input to the binary classifier. It only looks on a pair of tokens at a time, therefore predicted edges may not form a valid tree.</p><p>Another approach, induction of the whole syntactic structures from latent representations was proposed by Hewitt and Manning <ref type="bibr" target="#b11">[12]</ref>. Their syntactic probing is based on training a matrix which is used to transform the output of network's layers (they use BERT and ELMo). The objective of the probing is to approximate dependency tree distances between tokens 3 by the L2 norm of the difference of the transformed vectors. Probing produces the approximate syntactic pairwise distances for each pair of tokens. The minimum spanning tree algorithm is used on the distance matrix to find the undirected dependency tree. The best configuration employs the 15th layer of BERT large and induces treebank with 82.5% UAS on Penn Treebank with Stanford Dependency annotation (relation directions and punctuation were disregarded in the experiments). The result for BERT is significantly higher than for ELMo, which gave 77.0% when the first layer was probed.</p><p>The paper also describes an alternative method of approximating the syntactic depth by the L2 norm of latent vector multiplied by a trainable matrix. The estimated depths allow prediction of the root of a sentence with 90.1% accuracy when representation from the 16th layer of BERT large is probed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Multilingual Representations</head><p>The subsequent paper by Chi et al. <ref type="bibr" target="#b5">[6]</ref> applies the setting from <ref type="bibr" target="#b11">[12]</ref> to the multilingual language model mBERT. They train syntactic distance probes on 11 languages and compare UAS of induced trees in four scenarios: 1. training and evaluating on the same languages; 2. training on a single language, evaluating on a different one; 3. training on all languages except the evaluation one; 4. training on all languages, including the evaluation one. They demonstrate that the transfer is effective as the results in all the configurations outperform the baselines 4 . Even in the hardest case -zero-shot transfer from just one language, the result is at least 6.9 percent points above the baselines (for Chinese). Nevertheless, for all the languages, no transfer-learning setting can beat the training and evaluating a probe on the same language.</p><p>The paper includes analysis of intrinsic features of the BERT's vectors transformed by a probe. Noticeably, the vector differences between the representations of words connected by dependency relation are clustered by relation labels, see figure <ref type="figure">5</ref>.</p><p>Multilingual BERT embeddings are also analyzed by Wang et al. <ref type="bibr" target="#b35">[36]</ref>. They show that even for the multilingual vectors, the results can be improved by projecting vector spaces across languages. They use Biaffine Graph-based Parser by Dozat and Manning <ref type="bibr" target="#b9">[10]</ref>, which consists of multiple RNN layers. Therefore, the experiment is not strictly comparable with probing as the most of syntactic information is captured by the parser, and not by the embeddings. The article compares different types of vector representations fed as an input to the parser. It is demonstrated that cross-lingual transformation on mBERT embedding improves the results significantly in LAS of parser trained on English and evaluated on 14 languages (including English); on average, from 60.53% to 63.54%. In comparison to other cross-lingual representations, the proposed method outperforms transformed static embeddings (Fast-Text with SVD) and also slightly outperforms contextual embeddings (XLM).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Syntax in Transformer's Attention Matrices</head><p>Besides the vector representations of individual tokens, the Transformer architecture offers another representation 4 There are two baselines: right-branching tree and probing on randomly initialized mBERT without pretraining Figure <ref type="figure">5</ref>: Two dimensional t-SNE visualization of probed mBERT embeddings from <ref type="bibr" target="#b5">[6]</ref>. Analysis of the clusters shows that embeddings encode information about the type of dependency relations and, to a lesser extent, language. with a possible syntactic interpretation -the weights of the self-attention heads. In each head, information can flow from each token to any other one. These connections may be easily analyzed and compared to syntactic relations proposed by linguists. In this section, we will summarize different approaches of extracting syntax from attention. We present the methods both for dependency and constituency structures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Dependency Trees</head><p>Raganato and Tiedemann <ref type="bibr" target="#b30">[31]</ref> induce dependency trees from self-attention matrices of a neural machine translation encoder. They use the maximum spanning tree algorithm to connect pairs of tokens with high attention. Gold root information is used to find the direction of the edges. Trees extracted in this way are generally worse than the right-branching baseline <ref type="bibr">(35.</ref>08 % UAS on PUD) and outperform it slightly in a few heads. The maximum UAS is obtained when a dependency structure is induced from one head of the 5th layer of English to Chinese encoder -38.87% UAS. Nevertheless, their approach assumes that the whole syntactic tree may be induced from just one attention head.</p><p>Recent articles focused on the analysis of features and classification of Transformer's self-attention heads. Vig and Belinkov <ref type="bibr" target="#b33">[34]</ref> apply multiple metrics to examine properties of attention matrices computed in a unidirectional language model (GPT-2 <ref type="bibr" target="#b29">[30]</ref>). They showed that in some heads, the attentions concentrate on tokens representing specific POS tags and the pairs of tokens are more often attended one to another if an edge in the dependency tree  <ref type="bibr" target="#b34">[35]</ref> also observed alignment with dependency relations in the encoders of neural machine translation systems from English to Russian, German, or French. They have evaluated dependency accuracy for four dependency labels: noun subject, direct object, adjective modifier, and adverbial modifier. They separately address the cases where a verb attends to a dependent subject, and subject attends to governor verb. The heads with more than 10% improvement over a positional baseline are identified as syntactic <ref type="foot" target="#foot_3">6</ref> . Such heads are found in all encoder layers except the first one. In further experiments, the authors propose the algorithm to prune the heads from the model with a minimal decrease in translation performance. During pruning, the share of syntactic heads rises from 17% in the original model to 40% when 75% heads are cut out, while a change in translation score is negligible. These results support the claim that the model's ability to capture syntax is essential to its performance in non-syntactic tasks.</p><p>A similar evaluation of dependency accuracy for the BERT language model was conducted by Clark et al. <ref type="bibr" target="#b6">[7]</ref>.</p><p>They identify syntactic heads that significantly outperform positional baseline for the following labels: prepositional object, determiner, direct object, possession modifier, auxiliary passive, clausal component, marker, phrasal verb particle. The syntactic heads are found in the middle layers (4th to 8th). However, there is no single head that would capture the information for all the relations.</p><p>In another experiment, Clark et al. <ref type="bibr" target="#b6">[7]</ref> induce a dependency tree from attentions. Instead of extracting structure from each head <ref type="bibr" target="#b30">[31]</ref> they use probing to find the weighted average of all heads. The maximum spanning tree algorithm is used to induce the dependency structure from the average. This approach produces trees with 61% UAS and can be improved to 77% by making weights dependent on the static word representation (fixed GloVe vectors). Both the numbers are significantly higher than right branching baseline 27%.</p><p>A related analysis for English (BERT) and the multilingual variant (mBERT) was conducted by Limisiewicz et al. <ref type="bibr" target="#b14">[15]</ref>. We have observed that the information about one dependency type is split across many self-attention heads and in other cases, the opposite happens -many heads have the same syntactic function. They extract labeled dependency trees from the averaged heads and achieves 52% UAS and show that in the multilingual model (mBERT) specific relation (noun subject, determines) are found in the same heads across typologically similar languages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Constituency trees</head><p>There are fewer papers devoted to deriving constituency syntax tree structures.</p><p>Mareček and Rosa <ref type="bibr" target="#b20">[21]</ref> examined the encoder of the machine translation system for translation between English, French, and German. We observed that in some  heads, stretches of words attend to the same token forming shapes similar to balustrades (Figure <ref type="figure" target="#fig_6">7</ref>). Furthermore, those stretches usually overlap with syntactic phrases. This notion is employed in the new method for constituency tree induction. In their algorithm, the weights for each stretch of tokens are computed by summing the attention focused on the balustrades and then inducing a constituency tree with CKY algorithm <ref type="bibr" target="#b25">[26]</ref>. As a result, we produce trees that achieve up to 32.8% F1 score for English sentences, 43.6% for German and 44.2% for French. 7 The results can be improved by selecting syntactic heads and using only them in the algorithm. This approach requires a sample of 100 annotated sentences for head selection and raises F1  The extraction of constituency trees from language models was described by Kim et al. <ref type="bibr" target="#b12">[13]</ref>. They present a comprehensive study that covers nine types of pretrained networks: BERT (base, large), GPT-2 <ref type="bibr" target="#b29">[30]</ref> (original, medium), RoBERTa <ref type="bibr" target="#b18">[19]</ref> (base, large), XLNet <ref type="bibr" target="#b37">[38]</ref> (base, large). Their approach is based on computing distance between each pair of subsequent words. In each step, they are branching the tree in the place where the distance is the highest. The authors try three distance measures on the vector outputs of the encoder layer (cosine, L1, and L2 distances for pairs of vectors) and two distance measures on the distributions of token's attention (Jason-Shannon and Hellinger distances for pairs of distribution). In the former case, distances are computed only per layer and in the latter case for each head and average of heads in one layer. The best setting achieves 40.1% F1 score on WSJ Penn Treebank. It uses XLNet-base and Helinger distance on averaged attentions in the 7th layer. Generally, attention distribution distances perform better than vector ones. Authors also observe that models trained on regular language modeling objective (i.e., next word prediction in GPT, XL-Net) captured syntax better than masked language models (BERT, RoBERTa). In line with the previous research, the middle layers tend to be more syntactic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Syntactic information across layers</head><p>Figure <ref type="figure" target="#fig_7">8</ref> summarizes the evaluation of syntactic information across layers for different approaches. In Transformerbased language models: BERT, mBERT, and GPT-2, the middle layers are the most syntactic. In neural machine translation models, the top layers of the encoder are the most syntactic. However, it is important to note that the  <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b5">6]</ref>. The method D) shows the dependency alignment averaged across all heads in each layer <ref type="bibr" target="#b33">[34]</ref>. The methods E) and F) show UAS of trees induced from attention heads by the maximum spanning tree algorithm <ref type="bibr" target="#b30">[31,</ref><ref type="bibr" target="#b14">15]</ref>. The results for the best layer (corresponding to value 1.0 in the plot) are: A) 82.5; B) 79.8; C) 80.1; D) 22.3; E) 24.3; F) en2cs: 23.9, en2de: 20.9, en2et: 22.1, en2fi: 24.0, en2ru: 22.4, en2tr: 17.5, en2zh: 21.6; G) 77.0 NMT Transformer encoder is only the first half of the whole translation architecture, and therefore the most syntactic layers are, in fact, in the middle of the process. In RNN language model (ELMo) the first layer is more syntactic than the second one.</p><p>We conjecture that the initial Transformer's layers capture simple relations (e.g., attending to next or previous tokens) and the last layers mostly capture task-specific information. Therefore, they are less syntactic.</p><p>We also observe that in supervised probing <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b5">6]</ref>, better results are obtained from initial and top layers than in unsupervised structure induction <ref type="bibr" target="#b30">[31,</ref><ref type="bibr" target="#b14">15]</ref>, i.e., the distribution across layers is smoother.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>In this overview, we survey that syntactic structures are latently learned by the neural models for natural language processing tasks. We have compared multiple approaches of others and described the features that affect the ability to capture the syntax. The following aspects tend to improve the performance on syntactic tasks such as POS tagging:</p><p>1. Using contextual embeddings from RNNs or Transformer outperforms static word embeddings (Word2Vec, GloVe).</p><p>2. Pretraining on tasks with masked input (language modeling or machine translation) produces better syntactic representation than auto encoding. 3. The advantage of language modeling over machine translation is the fact that larger corpora are available for pretraining. Our meta-analysis of latent states showed that the most syntactic representation could be found in the middle layers of the model. They tend to capture more complex relations than initial layers, and the representations are less dependent on the pretraining objectives than in the top layers.</p><p>We have shown to what extent systems trained for a nonsyntactic task can learn grammatical structures. The question we leave for further research is whether providing explicit syntactic information to the model can improve its performance on other NLP tasks.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Visualization of attention mechanism in Transformer architecture. It shows which parts of the text are important to compute the representation for the word "to". Created in BertViz framework [33].</figDesc><graphic coords="2,56.69,80.50,231.02,138.68" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Spatial distribution of word embeddings depends on syntactic roles of words (visualization created by Ashutosh Singh).</figDesc><graphic coords="2,79.80,288.92,184.82,127.16" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Accuracy of POS tag probing from RNN representation by the pre-training objective.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head></head><label></label><figDesc>et al. 2017b<ref type="bibr" target="#b3">[4]</ref> Blevins et al. 2018<ref type="bibr" target="#b4">[5]</ref> Musil 2019<ref type="bibr" target="#b24">[25]</ref> Liu et al. 2019<ref type="bibr" target="#b16">[17]</ref> </figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Accuracy of POS tag probing from RNN latent vectors compared with static word embeddings</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Self-attentions in particular heads of a language model (BERT) aligns with dependency relation adjective modifiers and objects. The gold relations are marked with Xs.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: Balustrades observed in NMT's encoder tend to overlap with syntactic phrases.</figDesc><graphic coords="7,134.88,274.71,152.88,152.88" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 8 :</head><label>8</label><figDesc>Figure8: Relative syntactic information across attention models and layers. The values are normalized so that the best layer for each method has 1.0. The methods A), B), C), and G) show undirected UAS trees extracted by probing the n-th layer<ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b5">6]</ref>. The method D) shows the dependency alignment averaged across all heads in each layer<ref type="bibr" target="#b33">[34]</ref>. The methods E) and F) show UAS of trees induced from attention heads by the maximum spanning tree algorithm<ref type="bibr" target="#b30">[31,</ref><ref type="bibr" target="#b14">15]</ref>. The results for the best layer (corresponding to value 1.0 in the plot) are: A) 82.5; B) 79.8; C) 80.1; D) 22.3; E) 24.3; F) en2cs: 23.9, en2de: 20.9, en2et: 22.1, en2fi: 24.0, en2ru: 22.4, en2tr: 17.5, en2zh: 21.6; G) 77.0</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="5,307.56,80.51,231.01,223.12" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Summary of syntactic properties observed in Transformer's self-attention heads connects them, i.e., dependency alignment is high. They observe that the strongest dependency alignment occurs in the middle layers of the model -4th and 5th. They also point that different dependency types (labels) are captured in different places of the model. Attention in upper layers aligns more with subject relations whereas in the lower layer with modifying relations, such as auxiliaries, determiners, conjunctions, and expletives.Voita et al.</figDesc><table><row><cell>Research</cell><cell>Transformer Model</cell><cell>Type of tree</cell><cell>Syntactic</cell><cell cols="2">Evaluation data</cell><cell>Percentage</cell></row><row><cell></cell><cell></cell><cell></cell><cell>evaluation</cell><cell></cell><cell></cell><cell>of syntactic</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>heads</cell></row><row><cell>Raganato and</cell><cell>NMT Encoder</cell><cell>Dependency</cell><cell>Tree induction</cell><cell cols="2">PUD [27]</cell><cell>0% -8% 5</cell></row><row><cell>Tiedemann 2019 [31]</cell><cell>(6 layers 8 heads)</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>Vig and Belinkov 2019</cell><cell>LM (GPT-2)</cell><cell>Dependency</cell><cell>Dependency</cell><cell cols="2">Wikipedia (automati-</cell><cell>-</cell></row><row><cell>[34]</cell><cell></cell><cell></cell><cell>Alignment</cell><cell cols="2">cally annotated)</cell></row><row><cell>Clark et al. 2019 [7]</cell><cell>LM (BERT)</cell><cell>Dependency</cell><cell>Dependency</cell><cell cols="2">WSJ Penn Treebank</cell><cell>-</cell></row><row><cell></cell><cell></cell><cell></cell><cell>Accuracy,</cell><cell>[20]</cell><cell></cell></row><row><cell></cell><cell></cell><cell></cell><cell>Tree induction</cell><cell></cell><cell></cell></row><row><cell>Voita et al. 2019 [35]</cell><cell>NMT Encoder</cell><cell>Dependency</cell><cell>Dependency</cell><cell cols="2">WMT, OpenSubtitles</cell><cell>15% -19%</cell></row><row><cell></cell><cell>(6 layers 8 heads)</cell><cell></cell><cell>Accuracy</cell><cell cols="2">[16] (both automati-</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">cally annotated)</cell></row><row><cell>Limisiewicz et al. 2020</cell><cell>LMs</cell><cell>Dependency</cell><cell>Dependency</cell><cell cols="2">PUD [27], EuroParl</cell><cell>46%</cell></row><row><cell>[15]</cell><cell>(BERT, mBERT)</cell><cell></cell><cell>Accuracy,</cell><cell>[14]</cell><cell>(automatically</cell></row><row><cell></cell><cell></cell><cell></cell><cell>Tree induction</cell><cell cols="2">annotated)</cell></row><row><cell>Mareček and Rosa 2019</cell><cell>NMT Encoder</cell><cell>Constituency</cell><cell>Tree induction</cell><cell cols="2">EuroParl [14] (automat-</cell><cell>19% -33%</cell></row><row><cell>[21]</cell><cell>(6 layers 16 heads)</cell><cell></cell><cell></cell><cell cols="2">ically annotated)</cell></row><row><cell>Kim et al. 2019 [13]</cell><cell>LMs (BERT, GPT2,</cell><cell>Constituency</cell><cell>Tree induction</cell><cell cols="2">WSJ Penn Treebank</cell><cell>-</cell></row><row><cell></cell><cell>RoBERTa, XLNet)</cell><cell></cell><cell></cell><cell cols="2">[20], MNLI [37]</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">The test set is called syntactic by authors; nevertheless, it mostly focuses on morphological features.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">Layer numbering in this work: We are numbering layers starting from one for the layer closest to the input. Please note that original papers may use different numbering.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2">A head is syntactic when the tree extracted from it surpasses the right-branching chain in terms of UAS. It is a strong baseline for syntactic trees in English. Thus only a few heads are recognized as syntactic.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_3">In the positional baseline, the most frequent offset is added to the index of relation's dependent/governor to find its governor/dependent, e.g., for adjective to noun relations the most frequent offset is +1 in English</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This work has been supported by the grant 18-02196S of the Czech Science Foundation. It has been using language resources and tools developed, stored and distributed by theLINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2018101).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">Word embeddings: A survey</title>
		<author>
			<persName><forename type="first">Felipe</forename><surname>Almeida</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Geraldo</forename><surname>Xexéo</surname></persName>
		</author>
		<idno>CoRR, abs/1901.09069</idno>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Neural machine translation by jointly learning to align and translate</title>
		<author>
			<persName><forename type="first">Dzmitry</forename><surname>Bahdanau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kyunghyun</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoshua</forename><surname>Bengio</surname></persName>
		</author>
		<idno>CoRR, abs/1409.0473</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">What do neural machine translation models learn about morphology</title>
		<author>
			<persName><forename type="first">Yonatan</forename><surname>Belinkov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nadir</forename><surname>Durrani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fahim</forename><surname>Dalvi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hassan</forename><surname>Sajjad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">James</forename><surname>Glass</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 55th Annual Meeting of the Association for Computational Linguistics<address><addrLine>Vancouver, Canada</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2017-07">July 2017</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="861" to="872" />
		</imprint>
	</monogr>
	<note>: Long Papers)</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks</title>
		<author>
			<persName><forename type="first">Yonatan</forename><surname>Belinkov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lluís</forename><surname>Màrquez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hassan</forename><surname>Sajjad</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nadir</forename><surname>Durrani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fahim</forename><surname>Dalvi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">James</forename><surname>Glass</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Eighth International Joint Conference on Natural Language Processing</title>
		<title level="s">Long Papers</title>
		<meeting>the Eighth International Joint Conference on Natural Language Processing<address><addrLine>Taipei, Taiwan</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017-11">November 2017</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1" to="10" />
		</imprint>
	</monogr>
	<note>Asian Federation of Natural Language Processing</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Deep RNNs encode soft hierarchical syntax</title>
		<author>
			<persName><forename type="first">Terra</forename><surname>Blevins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Omer</forename><surname>Levy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Luke</forename><surname>Zettlemoyer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 56th Annual Meeting of the Association for Computational Linguistics<address><addrLine>Melbourne, Australia</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2018-07">July 2018</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="14" to="19" />
		</imprint>
	</monogr>
	<note>: Short Papers)</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Finding universal grammatical relations in multilingual BERT</title>
		<author>
			<persName><forename type="first">Ethan</forename><forename type="middle">A</forename><surname>Chi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">John</forename><surname>Hewitt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christopher</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 58th Annual Meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020-07">July 2020</date>
			<biblScope unit="page" from="5564" to="5577" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">What does BERT look at? An analysis of BERT&apos;s attention</title>
		<author>
			<persName><forename type="first">Kevin</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Urvashi</forename><surname>Khandelwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Omer</forename><surname>Levy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christopher</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Indexing by latent semantic analysis</title>
		<author>
			<persName><forename type="first">Scott</forename><surname>Deerwester</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Susan</forename><forename type="middle">T</forename><surname>Dumais</surname></persName>
		</author>
		<author>
			<persName><forename type="first">George</forename><forename type="middle">W</forename><surname>Furnas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Thomas</forename><forename type="middle">K</forename><surname>Landauer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Richard</forename><surname>Harshman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Society for Information Science</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="391" to="407" />
			<date type="published" when="1990">1990</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Bert: Pre-training of deep bidirectional transformers for language understanding</title>
		<author>
			<persName><forename type="first">Jacob</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ming-Wei</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kenton</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kristina</forename><surname>Toutanova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">NAACL-HLT</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Deep biaffine attention for neural dependency parsing</title>
		<author>
			<persName><forename type="first">Timothy</forename><surname>Dozat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christopher</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">5th International Conference on Learning Representations, ICLR 2017</title>
				<meeting><address><addrLine>Toulon, France</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">April 24-26, 2017. 2017</date>
		</imprint>
	</monogr>
	<note>Conference Track Proceedings</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Distributional structure</title>
		<author>
			<persName><forename type="first">Zellig</forename><surname>Harris</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Word</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="issue">23</biblScope>
			<biblScope unit="page" from="146" to="162" />
			<date type="published" when="1954">1954</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">A structural probe for finding syntax in word representations</title>
		<author>
			<persName><forename type="first">John</forename><surname>Hewitt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christopher</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">NAACL-HLT</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction</title>
		<author>
			<persName><forename type="first">Taeuk</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jihun</forename><surname>Choi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daniel</forename><surname>Edmiston</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sanggoo</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2020-01">January 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Europarl: A parallel corpus for statistical machine translation</title>
		<author>
			<persName><forename type="first">Philipp</forename><surname>Koehn</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page">11</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Universal dependencies according to BERT: both more specific and more general</title>
		<author>
			<persName><forename type="first">Tomasz</forename><surname>Limisiewicz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rudolf</forename><surname>Rosa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><surname>Mareček</surname></persName>
		</author>
		<idno>ArXiv, abs/2004.14620</idno>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">OpenSubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora</title>
		<author>
			<persName><forename type="first">Pierre</forename><surname>Lison</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jörg</forename><surname>Tiedemann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Milen</forename><surname>Kouylekov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)</title>
				<meeting>the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)<address><addrLine>Miyazaki, Japan</addrLine></address></meeting>
		<imprint>
			<publisher>ELRA</publisher>
			<date type="published" when="2018-05">May 2018</date>
		</imprint>
	</monogr>
	<note>European Language Resources Association</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Linguistic knowledge and transferability of contextual representations</title>
		<author>
			<persName><forename type="first">Nelson</forename><forename type="middle">F</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Matt</forename><surname>Gardner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yonatan</forename><surname>Belinkov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Matthew</forename><forename type="middle">E</forename><surname>Peters</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Noah</forename><forename type="middle">A</forename><surname>Smith</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">NAACL-HLT</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">A survey on contextual embeddings</title>
		<author>
			<persName><forename type="first">Qi</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Matt</forename><forename type="middle">J</forename><surname>Kusner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Phil</forename><surname>Blunsom</surname></persName>
		</author>
		<idno>ArXiv, abs/2003.07278</idno>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Roberta: A robustly optimized bert pretraining approach</title>
		<author>
			<persName><forename type="first">Yinhan</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Myle</forename><surname>Ott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Naman</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jingfei</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mandar</forename><surname>Joshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Danqi</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Omer</forename><surname>Levy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mike</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Luke</forename><surname>Zettlemoyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Veselin</forename><surname>Stoyanov</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1907.11692</idno>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Building a large annotated corpus of English: The Penn Treebank</title>
		<author>
			<persName><forename type="first">Mitchell</forename><forename type="middle">P</forename><surname>Marcus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Beatrice</forename><surname>Santorini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mary</forename><forename type="middle">Ann</forename><surname>Marcinkiewicz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="313" to="330" />
			<date type="published" when="1993">1993</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">From balustrades to pierre vinken: Looking for syntax in transformer self-attentions</title>
		<author>
			<persName><forename type="first">David</forename><surname>Mareček</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rudolf</forename><surname>Rosa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</title>
				<meeting>the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP<address><addrLine>Florence, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019-08">August 2019</date>
			<biblScope unit="page" from="263" to="275" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Learned in translation: Contextualized word vectors</title>
		<author>
			<persName><forename type="first">Bryan</forename><surname>Mccann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">James</forename><surname>Bradbury</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Caiming</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Richard</forename><surname>Socher</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="6297" to="6308" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<title level="m" type="main">Efficient estimation of word representations in vector space</title>
		<author>
			<persName><forename type="first">Tomas</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kai</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Greg</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jeffrey</forename><surname>Dean</surname></persName>
		</author>
		<idno>CoRR, abs/1301.3781</idno>
		<imprint>
			<date type="published" when="2013-07">July 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Linguistic regularities in continuous space word representations</title>
		<author>
			<persName><forename type="first">Tomas</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wen-Tau</forename><surname>Yih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Geoffrey</forename><surname>Zweig</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
				<meeting>the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies<address><addrLine>Atlanta, Georgia</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2013-06">June 2013</date>
			<biblScope unit="page" from="746" to="751" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Examining Structure of Word Embeddings with PCA</title>
		<author>
			<persName><forename type="first">Tomáš</forename><surname>Musil</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Text, Speech, and Dialogue</title>
				<imprint>
			<publisher>Springer International Publishing</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="211" to="223" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Dynamic programming parsing for contextfree grammars in continuous speech recognition</title>
		<author>
			<persName><forename type="first">H</forename><surname>Ney</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Signal Processing</title>
		<imprint>
			<biblScope unit="volume">39</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="336" to="340" />
			<date type="published" when="1991">1991</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Universal dependencies 2.0 -CoNLL 2017 shared task development and test data</title>
		<author>
			<persName><forename type="first">Joakim</forename><surname>Nivre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Željko</forename><surname>Agić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lars</forename><surname>Ahrenberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lene</forename><surname>Antonsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Maria</forename><surname>Jesus Aranzabe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Masayuki</forename><surname>Asahara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Luma</forename><surname>Ateyah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mohammed</forename><surname>Attia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aitziber</forename><surname>Atutxa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Elena</forename><surname>Badmaeva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Miguel</forename><surname>Ballesteros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Esha</forename><surname>Banerjee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sebastian</forename><surname>Bank</surname></persName>
		</author>
		<author>
			<persName><forename type="first">John</forename><surname>Bauer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kepa</forename><surname>Bengoetxea</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ahmad</forename><surname>Riyaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eckhard</forename><surname>Bhat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Cristina</forename><surname>Bick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Gosse</forename><surname>Bosco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sam</forename><surname>Bouma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aljoscha</forename><surname>Bowman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marie</forename><surname>Burchardt</surname></persName>
		</author>
		<author>
			<persName><surname>Candito</surname></persName>
		</author>
		<author>
			<persName><surname>Gauthier Caron</surname></persName>
		</author>
		<author>
			<persName><surname>Güls ¸en Cebiroglu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Giuseppe</forename><forename type="middle">G A</forename><surname>Eryigit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Savas</forename><surname>Celano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fabricio</forename><surname>Cetin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jinho</forename><surname>Chalub</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yongseok</forename><surname>Choi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Silvie</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Cinková</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Miriam</forename><surname>¸agrı C ¸öltekin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marie-Catherine</forename><surname>Connor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Valeria</forename><surname>De Marneffe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Arantza</forename><surname>De Paiva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kaja</forename><surname>Diaz De Ilarraza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Timothy</forename><surname>Dobrovoljc</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kira</forename><surname>Dozat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marhaba</forename><surname>Droganova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ali</forename><surname>Eli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tomaž</forename><surname>Elkahky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Richárd</forename><surname>Erjavec</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hector</forename><surname>Farkas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jennifer</forename><surname>Fernandez Alcalde</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Cláudia</forename><surname>Foster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Katarína</forename><surname>Freitas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daniel</forename><surname>Gajdošová</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marcos</forename><surname>Galbraith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Filip</forename><surname>Garcia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Iakes</forename><surname>Ginter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Koldo</forename><surname>Goenaga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Memduh</forename><surname>Gojenola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoav</forename><surname>Gökırmak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xavier</forename><surname>Goldberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Berta</forename><forename type="middle">Gonzáles</forename><surname>Gómez Guinovart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Matias</forename><surname>Saavedra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Normunds</forename><surname>Grioni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bruno</forename><surname>Grūzītis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nizar</forename><surname>Guillaume</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jan</forename><surname>Habash</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jan</forename><surname>Hajič</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Linh</forename><surname>Hajič Jr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kim</forename><surname>Hà Mỹ</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dag</forename><surname>Harris</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Barbora</forename><surname>Haug</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jaroslava</forename><surname>Hladká</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Petter</forename><surname>Hlaváčová</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Radu</forename><surname>Hohle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Elena</forename><surname>Ion</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anders</forename><surname>Irimia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fredrik</forename><surname>Johannsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hüner</forename><surname>Jørgensen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hiroshi</forename><surname>Kas ¸ıkara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jenna</forename><surname>Kanayama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tolga</forename><surname>Kanerva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Václava</forename><surname>Kayadelen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jesse</forename><surname>Kettnerová</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Natalia</forename><surname>Kirchner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Simon</forename><surname>Kotsyba</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sookyoung</forename><surname>Krek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Veronika</forename><surname>Kwak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lorenzo</forename><surname>Laippala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tatiana</forename><surname>Lambertino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Phương</forename><surname>Lando</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alessandro</forename><surname>Lê H Ồng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Saran</forename><surname>Lenci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Herman</forename><surname>Lertpradit</surname></persName>
		</author>
		<author>
			<persName><surname>Leung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ying</forename><surname>Cheuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Josie</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nikola</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Olga</forename><surname>Ljubešić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Olga</forename><surname>Loginova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Teresa</forename><surname>Lyashevskaya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vivien</forename><surname>Lynn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aibek</forename><surname>Macketanz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michael</forename><surname>Makazhanov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christopher</forename><surname>Mandl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ruli</forename><surname>Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Cătălina</forename><surname>Manurung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><surname>Mărănduc</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Katrin</forename><surname>Mareček</surname></persName>
		</author>
		<author>
			<persName><surname>Marheinecke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Martínez</forename><surname>Héctor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">André</forename><surname>Alonso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jan</forename><surname>Martins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yuji</forename><surname>Mašek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ryan</forename><surname>Matsumoto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Gustavo</forename><surname>Mcdonald</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anna</forename><surname>Mendonc ¸a</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Verginica</forename><surname>Missilä</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yusuke</forename><surname>Mititelu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Simonetta</forename><surname>Miyao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Amir</forename><surname>Montemagni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Laura</forename><forename type="middle">Moreno</forename><surname>More</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shunsuke</forename><surname>Romero</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bohdan</forename><surname>Mori</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kadri</forename><surname>Moskalevskyi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nina</forename><surname>Muischnek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kaili</forename><surname>Mustafina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Pinkey</forename><surname>Müürisep</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anna</forename><surname>Nainwani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lương</forename><surname>Nedoluzhko</surname></persName>
		</author>
		<author>
			<persName><surname>Nguy Ễn Thị</surname></persName>
		</author>
		<author>
			<persName><surname>Huy Ền Nguy Ễn Thị</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vitaly</forename><surname>Minh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rattima</forename><surname>Nikolaev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hanna</forename><surname>Nitisaroj</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stina</forename><surname>Nurmi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Petya</forename><surname>Ojala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lilja</forename><surname>Osenova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Elena</forename><surname>Øvrelid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marco</forename><surname>Pascual</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Cenel-Augusto</forename><surname>Passarotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Guy</forename><surname>Perez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Slav</forename><surname>Perrier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jussi</forename><surname>Petrov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Emily</forename><surname>Piitulainen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Barbara</forename><surname>Pitler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Martin</forename><surname>Plank</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lauma</forename><surname>Popel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Prokopis</forename><surname>Pretkalnin ¸a</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tiina</forename><surname>Prokopidis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sampo</forename><surname>Puolakainen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alexandre</forename><surname>Pyysalo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Livy</forename><surname>Rademaker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Siva</forename><surname>Real</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Georg</forename><surname>Reddy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Larissa</forename><surname>Rehm</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Laura</forename><surname>Rinaldi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rudolf</forename><surname>Rituma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Davide</forename><surname>Rosa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shadi</forename><surname>Rovati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Manuela</forename><surname>Saleh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Baiba</forename><surname>Sanguinetti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yanin</forename><surname>Saulīte</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sebastian</forename><surname>Sawanakunanon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Djamé</forename><surname>Schuster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wolfgang</forename><surname>Seddah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mojgan</forename><surname>Seeker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lena</forename><surname>Seraji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mo</forename><surname>Shakurova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Atsuko</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Muh</forename><surname>Shimada</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Natalia</forename><surname>Shohibussirri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Maria</forename><surname>Silveira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Radu</forename><surname>Simi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Katalin</forename><surname>Simionescu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mária</forename><surname>Simkó</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kiril</forename><surname>Šimková</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aaron</forename><surname>Simov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Antonio</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jana</forename><surname>Stella</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alane</forename><surname>Strnadová</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Umut</forename><surname>Suhr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zsolt</forename><surname>Sulubacak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dima</forename><surname>Szántó</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Takaaki</forename><surname>Taji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Trond</forename><surname>Tanaka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anna</forename><surname>Trosterud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Reut</forename><surname>Trukhina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Francis</forename><surname>Tsarfaty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sumire</forename><surname>Tyers</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zdeňka</forename><surname>Uematsu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Larraitz</forename><surname>Urešová</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hans</forename><surname>Uria</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Gertjan</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Viktor</forename><surname>Van Noord</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Veronika</forename><surname>Varga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jonathan</forename><forename type="middle">North</forename><surname>Vincze</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zhuoran</forename><surname>Washington</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zdeněk</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daniel</forename><surname>Žabokrtský</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hanzhi</forename><surname>Zeman</surname></persName>
		</author>
		<author>
			<persName><surname>Zhu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">LIN-DAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL)</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
		<respStmt>
			<orgName>Faculty of Mathematics and Physics, Charles University</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Glove: Global vectors for word representation</title>
		<author>
			<persName><forename type="first">Jeffrey</forename><surname>Pennington</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Richard</forename><surname>Socher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christopher</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Empirical Methods in Natural Language Processing (EMNLP)</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="1532" to="1543" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Deep contextualized word representations</title>
		<author>
			<persName><forename type="first">Matthew</forename><forename type="middle">E</forename><surname>Peters</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mark</forename><surname>Neumann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mohit</forename><surname>Iyyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Matt</forename><surname>Gardner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christopher</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kenton</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Luke</forename><surname>Zettlemoyer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
				<meeting>the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies<address><addrLine>New Orleans, Louisiana</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2018-06">June 2018</date>
			<biblScope unit="volume">1</biblScope>
		</imprint>
	</monogr>
	<note>Long Papers</note>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<title level="m" type="main">Language models are unsupervised multitask learners</title>
		<author>
			<persName><forename type="first">Alec</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jeff</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rewon</forename><surname>Child</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><surname>Luan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dario</forename><surname>Amodei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ilya</forename><surname>Sutskever</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">An analysis of encoder representations in transformer-based machine translation</title>
		<author>
			<persName><forename type="first">Alessandro</forename><surname>Raganato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jörg</forename><surname>Tiedemann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</title>
				<meeting>the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP<address><addrLine>Brussels, Belgium</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018-11">November 2018</date>
			<biblScope unit="page" from="287" to="297" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Attention is all you need</title>
		<author>
			<persName><forename type="first">Ashish</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Noam</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Niki</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jakob</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Llion</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aidan</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lukasz</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Illia</forename><surname>Polosukhin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017</title>
				<meeting><address><addrLine>Long Beach, CA, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017-12">December 2017. 2017</date>
			<biblScope unit="page" from="5998" to="6008" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">A multiscale visualization of attention in the transformer model</title>
		<author>
			<persName><forename type="first">Jesse</forename><surname>Vig</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019</title>
				<meeting>the 57th Conference of the Association for Computational Linguistics, ACL 2019<address><addrLine>Florence, Italy</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2019-08-02">July 28 -August 2, 2019. 2019</date>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="37" to="42" />
		</imprint>
	</monogr>
	<note>System Demonstrations</note>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Analyzing the Structure of Attention in a Transformer Language Model</title>
		<author>
			<persName><forename type="first">Jesse</forename><surname>Vig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yonatan</forename><surname>Belinkov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</title>
				<meeting>the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP<address><addrLine>Florence, Italy</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2019-08">August 2019</date>
			<biblScope unit="page" from="63" to="76" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Analyzing multi-head selfattention: Specialized heads do the heavy lifting, the rest can be pruned</title>
		<author>
			<persName><forename type="first">Elena</forename><surname>Voita</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><surname>Talbot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fedor</forename><surname>Moiseev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rico</forename><surname>Sennrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ivan</forename><surname>Titov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 57th Annual Meeting of the Association for Computational Linguistics<address><addrLine>Florence, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019-07">July 2019</date>
			<biblScope unit="page" from="5797" to="5808" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">Cross-lingual bert transformation for zero-shot dependency parsing</title>
		<author>
			<persName><forename type="first">Yuxuan</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wanxiang</forename><surname>Che</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jiang</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yijia</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ting</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</title>
				<meeting>the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">A broad-coverage challenge corpus for sentence understanding through inference</title>
		<author>
			<persName><forename type="first">Adina</forename><surname>Williams</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nikita</forename><surname>Nangia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Samuel</forename><surname>Bowman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
				<meeting>the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies<address><addrLine>New Orleans, Louisiana</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018-06">June 2018</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="1112" to="1122" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b37">
	<monogr>
		<title level="m" type="main">Xlnet: Generalized autoregressive pretraining for language understanding</title>
		<author>
			<persName><forename type="first">Zhilin</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zihang</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yiming</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jaime</forename><forename type="middle">G</forename><surname>Carbonell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ruslan</forename><surname>Salakhutdinov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Quoc</surname></persName>
		</author>
		<author>
			<persName><surname>Le</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2019">2019</date>
			<publisher>NeurIPS</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<analytic>
		<title level="a" type="main">Language modeling teaches you more syntax than translation does: Lessons learned through auxiliary task analysis</title>
		<author>
			<persName><forename type="first">Kelly</forename><forename type="middle">W</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Samuel</forename><forename type="middle">R</forename><surname>Bowman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</title>
				<meeting>the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP</meeting>
		<imprint>
			<date type="published" when="2018-11">November 2018</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
