<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Projecting Heterogeneous Annotations for Named Entity Recognition</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Rodrigo</forename><surname>Agerri</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">HiTZ Center -Ixa</orgName>
								<orgName type="institution">University of the Basque Country UPV/EHU</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">German</forename><surname>Rigau</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">HiTZ Center -Ixa</orgName>
								<orgName type="institution">University of the Basque Country UPV/EHU</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Projecting Heterogeneous Annotations for Named Entity Recognition</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">D771628F98A471B96377B831E0B9C145</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T04:20+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Named Entity Recognition</term>
					<term>Information Extraction</term>
					<term>Natural Language Processing</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In this paper we describe our participation in the CAPITEL at IberLEF 2020 shared task on Named Entity Recognition (NER). Our objectives to participate in the shared task were two-fold: (i) to benchmark current rich multilingual representations of text with respect to monolingual models trained specifically for Spanish; (ii) to study various methods of projecting annotations from several sources into a final target prediction. Our results show that monolingual models, even for a large language such as Spanish, perform better in this particular NER benchmark. Furthermore, our projection method indicates that substantial gains in performance can be obtained by projecting annotations from various heterogeneous sources to obtain the final prediction. Our submission obtained the best score, substantially outperforming other participants of the CAPITEL 2020 NER task.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Named Entity Recognition (NER) is a widely studied Natural Language Processing (NLP) task. Briefly, the task involves annotating any mentions of entities (usually proper names) occurring in running text. The most common annotated corpora for NER focuses on four types of named entities: Locations, Organizations, Persons and Other (Miscellaneous) entities. Spanish NER has been well-studied, as it was one of the languages proposed in the CoNLL NER shared tasks <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref>.</p><p>As for many other NLP tasks, current best performing models for NER are those based on large pre-trained language models which allow to build rich representations of text based on contextual word embeddings. These approaches are based on character-based models like Flair <ref type="bibr" target="#b2">[3]</ref> or masked language models like BERT <ref type="bibr" target="#b3">[4]</ref>. Furthermore, multilingual versions of these models have been trained: the multilingual version of BERT <ref type="bibr" target="#b3">[4]</ref> was trained for 104 languages. More recently, XLM-RoBERTa <ref type="bibr" target="#b4">[5]</ref> was trained for 100 languages.</p><p>These publicly available deep learning multilingual models for text excel in tasks involving highresourced languages such as English, but their performance drops when applied to low-resource languages <ref type="bibr" target="#b5">[6]</ref>. This may occur for a number of reasons. First, each language has to share the quota of substrings and parameters with the rest of the languages represented in the pre-trained multilingual model. As the quota of substrings partially depends on corpus size, this means that larger languages such as English or Spanish are better represented than lower resourced languages such as Basque <ref type="bibr" target="#b5">[6]</ref>. Moreover, multilingual models also seem to behave better for structurally similar languages <ref type="bibr" target="#b6">[7]</ref>.</p><p>In our submission for the CAPITEL 2020 NER task <ref type="bibr" target="#b7">[8]</ref> we leverage both these multilingual models as well as other monolingual models trained specifically for Spanish. Furthermore, we project the annotations provided by each system into a target final prediction. The projection of several source annotations into a target is loosely inspired by a method originally designed for projection of annotations across languages <ref type="bibr" target="#b8">[9]</ref>. Our projection method indicates that substantial gains in performance (around 1.3 points in F1 score) can be obtained by projecting annotations from various heterogeneous sources into a final target prediction. Our submission obtained the best score, substantially outperforming other participants of the CAPITEL 2020 NER task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Deep learning methods in NLP rely on the ability to represent words as continuous vectors on a low dimensional space, called word embeddings. The first approaches generated static word embeddings <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b10">11]</ref>, namely, they provided a unique vector-based representation for a given word independently of the context in which the word occurs. This means that polysemy cannot be represented. Thus, if we consider the word 'bank', static word embedding approaches will generate only one vector representation even though such word may have different senses, namely, 'financial institution', 'bench', etc.</p><p>In order to address this problem, contextual word embeddings were proposed. The idea is to be able to generate different word representations according to the context in which the word appears. Currently there are many approaches to generate such contextual word representations, but we will focus on those that have had a direct impact, in terms of performance, for the Named Entity Recognition task. First, Flair <ref type="bibr" target="#b2">[3]</ref> representations are built following a LSTM-based architecture and trained as language models. Second, the models based on the transformer architecture <ref type="bibr" target="#b11">[12]</ref> and of which BERT is perhaps the most popular example <ref type="bibr" target="#b3">[4]</ref>.</p><p>The multilingual counterpart of BERT, called mBERT, is a single language model pre-trained from corpora in more than 100 languages. Another standout model is XLM-RoBERTa <ref type="bibr" target="#b4">[5]</ref> also based on the transformer architecture which provides a pre-trained language model for 100 languages trained on 2.5 TB of Common Crawl text. Both mBERT and XLM-RoBERTa enable to perform transfer knowledge across languages <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b6">7]</ref>, although in this paper we will use them in a monolingual setting for Spanish NER.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Flair</head><p>Flair refers to a system based on a BiLSTM architecture <ref type="bibr" target="#b14">[15]</ref> and to a specific type of characterbased contextual word embeddings. Flair (embeddings and system) have been successfully applied to sequence labeling tasks obtaining state-of-the-art results for a number of Named Entity Recognition (NER) and Part-of-Speech tagging benchmarks <ref type="bibr" target="#b2">[3]</ref>.</p><p>Flair embeddings consist of sequences of characters. More specifically, sentences are processed into sequences of characters and feed into a character-level Long short-term memory (LSTM) model. For each sentence, a forward LSTM language model processes the its sequence of characters from the beginning of the sentence to the last character of the word we are modeling. Furthermore, a backward LSTM performs the same operation going from the end of the sentence up to the first character of the word. The extracted hidden states contain information propagated from the end and the beginning of the sentence up to the first and the last character of the target word. Finally, the resulting two hidden states are concatenated to generate the final embedding.</p><p>Pooled embeddings are a type of Flair embeddings which consider global information in order to generate the final word embedding <ref type="bibr" target="#b15">[16]</ref>. In this approach embeddings are kept into a memory which is later used in a pooling operation to obtain a global word representation. This representation will be the concatenation of all the local Flair contextualized embeddings obtained for a given word. It should be consider that pooling operation is involved in the process of fine-tuning the Flair pre-trained models, not in the process of training the language models themselves. We use the default pooling operation, min, which computes a vector of all element-wise minimum values <ref type="bibr" target="#b15">[16]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Transformers</head><p>LSTM-based language models such as the one presented in the previous section cannot capture longrange sequence information. Furthermore, they are quite hard to train at a large scale (see <ref type="bibr" target="#b16">[17]</ref>, especially Figure <ref type="figure">7</ref>). In order to address these issues, the Transformer architecture was proposed <ref type="bibr" target="#b11">[12]</ref>, based on multi-headed self-attention and positional encoding. The most popular Transformer is BERT <ref type="bibr" target="#b3">[4]</ref>, which pre-trains a Transformer encoder on the Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks. BERT is composed by stacked layers of Transformer encoders <ref type="bibr" target="#b11">[12]</ref>. More specifically, in this paper we will use the BERT BASE configuration which contains 12 Transformer encoder layers, a hidden size of 768 and 12 self-attention heads for a total of 110M parameters.</p><p>The MLM task is designed as follows: For a input sequence of 𝑛 tokens 𝑥 1 , 𝑥 2 , ..., 𝑥 𝑛 , 15% are selected as masking candidates. From those candidates, 80% of them are masked (they are replaced with the [MASK] token), 10% are replaced by a random word and the last 10% is left unchanged. For the NSP task, two segments are selected from the training corpus, 𝐴 and 𝐵. In 50% of the cases 𝐵 is the true next segment for 𝐴. For the rest, 𝐵 is just a random segment. The model is trained to optimize the sum of the means of the MLM and NSP likelihoods.</p><p>It should be noted that the benefits of the NSP task during the pre-training process has been questioned <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b19">20]</ref>. Thus, other transformer proposals such as RoBERTa train without the NSP task, showing strong performance on the same downstream tasks.</p><p>XLM-RoBERTa relies exclusively on the MLM objective. The biggest update that XLM-Roberta offers is a significantly increased amount of training data, 2.5TB of Common Crawl clean data <ref type="bibr" target="#b4">[5]</ref>. As for BERT, in this paper we use the base version of XLM-RoBERTa. The reason being that their base versions fit for fine-tuning into a standard GPU card with 12GB of RAM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experimental Setup</head><p>Named entities were originally annotated using the BIO encoding which identifies the Beginning, the Inside and the Outside of named entities. Later on the BILOU model<ref type="foot" target="#foot_0">1</ref> was proposed to mark tokens as the Beginning, the Inside and the Last tokens of multi-token entities as well as Unit-length entities <ref type="bibr" target="#b20">[21]</ref>. Although the CAPITEL corpus is originally released using the BILOU model, we experiment with both type of encodings.</p><p>The CAPITEL (Corpus del Plan de Impulso a las Tecnologías del Lenguaje) has been developed by the PlanTL, the Royal Spanish Academy (RAE) and the Secretariat of State for Digital Advancement (SEAD) of the Ministry of Economy. These organizations signed an agreement for developing a linguistically annotated corpus of Spanish news articles, with the objective of extending the language resource infrastructure for the Spanish language. CAPITEL is composed of contemporary news articles and contains annotations for Universal Dependencies and Named Entities. The NER portion of the corpus contains around one million words.</p><p>For the experiments performed for this paper, we use a number of publicly available models: Additionally we trained the following monolingual language models for Spanish:</p><p>1. Flair-GW: Flair character-based language model trained on the Spanish Wikipedia and the Gigaword 3rd edition corpus, containing around 11GB of text. 2. Flair-Oscar: Flair language model trained on the OSCAR Spanish corpus <ref type="bibr" target="#b22">[23]</ref>, which contains 157GB of Common Crawl text cleaned and deduplicated.</p><p>The Flair embeddings for Flair-GW and Flair-Oscar were trained with the following parameters: Hidden size 2048, sequence length of 250, and a mini-batch size of 100. The rest of the parameters were left in their default setting. For Flair-GW, training was done for 5 epochs over the full training corpus. The training took around 5 days in a Nvidia Titan V GPU. With respect to Flair-Oscar, only one epoch was performed, requiring around a month to complete it.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results</head><p>Table <ref type="table" target="#tab_0">1</ref> reports only the best results obtained during the experimentation. Each of the S1-S8 results is the average of five randomly initialized runs. Flair models were trained using the default parameters, although we experimented adding FastText embeddings to the Flair and Pooled embeddings. We used 10 percent of the training data for development of the Flair models. In the case of the Transformer models described in the previous section, we used the full training set for hyperparameter fine-tuning. For XLM-RoBERTa we used a maximum sequence length of 128, mini-batch 16, 5e-5 learning rate, and 4 epochs. For mBERT and BETO best results were obtained using the same hyperparameters as for XLM-RoBERTa but increasing the sequence length to 256.</p><p>Out of the many experiments performed with the three Flair language models (Official, GW and Oscar), the best performing language model in every possible configuration was the Flair-Oscar model combined with the FastText embeddings trained on Wikipedia. In fact, Flair-Oscar was the best single system by a substantial margin. Apart from this, S2 and S3 show the small gains obtained by adding the 10 percent used for development for the final evaluation. Furthermore, S3 was trained when the progress of training the language model was at half epoch, whereas S4 was trained using the final Oscar language model based on one epoch. Finally, S5 is the same model as S1 but using BIO encoding instead of the original BILOU encoding from the CAPITEL corpus. The best overall invididual system was S4, significantly outperforming the multilingual and monolingual Transformer models.</p><p>With respect to the transformer models, it can be seen that in general their results are lower than those obtained by the Flair-Oscar models. During the development phase they all performed very closely although in the final, official results XLM-RoBERTa was slightly superior to the rest. Furthermore, results also show that mBERT performed worst and that XLM-RoBERTa obtains very similar results to the monolingual models.</p><p>The last three rows of Table <ref type="table" target="#tab_0">1</ref> report the three best projections. Once we had the best 8 systems, we proceeded to project their predictions by means of any possible combination of the 8 systems. The best three systems were picked based on two criteria: the F1 score obtained on the development data and the number of No-agreements recorded by each projection.</p><p>The projections were performed using 5 predictions as source. We tested various strategies and the one we finally used to report the final results was, interestingly enough, the simplest of them all. It uses a very simple methodology based on the number of agreements between the predicted labels of the 5 source annotations: if agreement is &gt;= 3 then project, otherwise, project "O".</p><p>As we could not compute F1 scores on the official test set released by the shared task, we simply picked the projection which recorded fewer No-agreements. This corresponds to the best overall system (P3), which uses S3, S4, S6, S7 and S8 as source to obtain the final prediction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Concluding Remarks</head><p>In this paper we have described the experiments performed for our participation in the CAPITEL 2020 shared task on Named Entity Recognition. Even though the best results are obtained by the Flair-Oscar monolingual models, our results indicate that multilingual pre-trained models such as XLM-RoBERTa are performing increasingly close to monolingual models for a large-resourced language such as Spanish. Furthermore, we also show the benefits of projecting named entity annotations from various heterogeneous sources in order to substantially improve performance (around 1.3 points in F1 score over the best individual system).</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020) email: rodrigo.agerri@ehu.eus (R. Agerri) orcid: 0000-0002-7303-7598 (R. Agerri); 0000-0003-1119-0930 (G. Rigau)</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Overview results on both development and test data.</figDesc><table><row><cell></cell><cell></cell><cell>Development</cell><cell></cell><cell></cell><cell>Test</cell><cell></cell></row><row><cell>System</cell><cell cols="6">Precision Recall F1 score Precision Recall F1 score</cell></row><row><cell>S1 Flair-Oscar + FT</cell><cell>89.65</cell><cell>89.36</cell><cell>89.51</cell><cell>88.86</cell><cell>88.63</cell><cell>88.74</cell></row><row><cell>S2 Flair-Oscar + FT (dev)</cell><cell>89.67</cell><cell>89.53</cell><cell>89.60</cell><cell>88.97</cell><cell>88.75</cell><cell>88.86</cell></row><row><cell>S3 Pool-Oscar + FT (dev)</cell><cell>89.85</cell><cell>89.63</cell><cell>89.79</cell><cell>89.07</cell><cell>88.85</cell><cell>88.96</cell></row><row><cell>S4 Pool-Oscar + FT e1</cell><cell>89.78</cell><cell>89.72</cell><cell>89.75</cell><cell>89.29</cell><cell>88.82</cell><cell>89.07</cell></row><row><cell>S5 Flair-Oscar + FT BIO</cell><cell>89.71</cell><cell>89.58</cell><cell>89.64</cell><cell>89.19</cell><cell>88.78</cell><cell>88.99</cell></row><row><cell>S6 BETO</cell><cell>89.64</cell><cell>89.34</cell><cell>88.99</cell><cell>87.19</cell><cell>88.36</cell><cell>87.77</cell></row><row><cell>S7 mBERT</cell><cell>87.90</cell><cell>88.90</cell><cell>88.40</cell><cell>87.03</cell><cell>87.75</cell><cell>87.39</cell></row><row><cell>S8 XLM-RoBERTa</cell><cell>88.29</cell><cell>89.54</cell><cell>88.91</cell><cell>87.37</cell><cell>88.48</cell><cell>87.92</cell></row><row><cell>P1 S2-S3-S6-S7-S8</cell><cell>91.32</cell><cell>90.77</cell><cell>91.04</cell><cell>90.70</cell><cell>88.11</cell><cell>89.38</cell></row><row><cell>P2 S2-S4-S6-S7-S8</cell><cell>91.10</cell><cell>90.59</cell><cell>90.84</cell><cell>90.81</cell><cell>88.06</cell><cell>89.42</cell></row><row><cell>P3 S3-S4-S6-S7-S8</cell><cell>91.19</cell><cell>90.72</cell><cell>90.96</cell><cell>90.50</cell><cell>90.17</cell><cell>90.34</cell></row><row><cell>1. Multilingual BERT (mBERT).</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell>2. XLM-RoBERTa (base).</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="7">3. BETO, a monolingual Spanish BERT trained with Wikipedia and Spanish data from the OPUS</cell></row><row><cell>corpus [22].</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">4. Flair official models for Spanish.</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">Nowadays also known as the BIOES encoding: Beginning, Inside, Outside, End of entity and Single entity.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>Innovation and Universities (DeepReading RTI2018-096846-B-C21, MCIU/AEI/FEDER, UE) and by Ayudas Fundación BBVA a Equipos de Investigación Científica 2018 (BigKnowledge). Rodrigo Agerri is funded by the RYC-2017-23647 fellowship and acknowledges the donation of a Titan V GPU by the NVIDIA Corporation.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">F</forename><surname>Tjong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kim</forename><surname>Sang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of CoNLL-2002</title>
				<meeting>CoNLL-2002<address><addrLine>Taipei, Taiwan</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="155" to="158" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Introduction to the CoNLL-2003 shared task: Languageindependent named entity recognition</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">F</forename><surname>Tjong Kim Sang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">De</forename><surname>Meulder</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003</title>
				<meeting>the seventh conference on Natural language learning at HLT-NAACL 2003</meeting>
		<imprint>
			<date type="published" when="2003">2003</date>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="142" to="147" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Contextual string embeddings for sequence labeling</title>
		<author>
			<persName><forename type="first">A</forename><surname>Akbik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Blythe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Vollgraf</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">27th International Conference on Computational Linguistics</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1638" to="1649" />
		</imprint>
	</monogr>
	<note>COL-ING 2018</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">BERT: pre-training of deep bidirectional transformers for language understanding</title>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019</title>
				<meeting>the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019<address><addrLine>Minneapolis, MN, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">June 2-7, 2019. 2019</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="4171" to="4186" />
		</imprint>
	</monogr>
	<note>Long and Short Papers</note>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Conneau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Khandelwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Chaudhary</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Wenzek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Guzmán</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Grave</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Stoyanov</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1911.02116</idno>
		<title level="m">Unsupervised cross-lingual representation learning at scale</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Give your text representation models some love: the case for basque</title>
		<author>
			<persName><forename type="first">R</forename><surname>Agerri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>San</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Vicente</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Campos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Barrena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Saralegi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Soroa</surname></persName>
		</author>
		<author>
			<persName><surname>Agirre</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of The 12th Language Resources and Evaluation Conference (LREC 2020)</title>
				<meeting>The 12th Language Resources and Evaluation Conference (LREC 2020)</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="4781" to="4788" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Cross-lingual ability of multilingual bert: An empirical study</title>
		<author>
			<persName><forename type="first">K</forename><surname>Karthikeyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mayhew</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Roth</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations (ICLR)</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Overview of CAPITEL Shared Tasks at IberLEF 2020: NERC and Universal Dependencies Parsing</title>
		<author>
			<persName><forename type="first">J</forename><surname>Porta-Zamorano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Espinosa-Anke</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Iberian Languages Evaluation Forum</title>
				<meeting>the Iberian Languages Evaluation Forum<address><addrLine>IberLEF</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020. 2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Building named entity recognition taggers via parallel corpora</title>
		<author>
			<persName><forename type="first">R</forename><surname>Agerri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Chung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Aldabe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Aranberri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Labaka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Rigau</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)</title>
				<meeting>the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Distributed representations of words and phrases and their compositionality</title>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">S</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="3111" to="3119" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Enriching word vectors with subword information</title>
		<author>
			<persName><forename type="first">P</forename><surname>Bojanowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Grave</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Joulin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mikolov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page" from="135" to="146" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Attention is all you need</title>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ł</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Polosukhin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in neural information processing systems</title>
				<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="5998" to="6008" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Sequence tagging with contextual and non-contextual subword representations: A multilingual evaluation</title>
		<author>
			<persName><forename type="first">B</forename><surname>Heinzerling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Strube</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</title>
				<meeting>the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics<address><addrLine>Florence, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="273" to="291" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">How multilingual is multilingual bert?</title>
		<author>
			<persName><forename type="first">T</forename><surname>Pires</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Schlinger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Garrette</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 57th Annual Meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="4996" to="5001" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Bidirectional LSTM-CRF Models for Sequence Tagging</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Yu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1508.01991</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Pooled contextualized embeddings for named entity recognition</title>
		<author>
			<persName><forename type="first">A</forename><surname>Akbik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Bergmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Vollgraf</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Annual Conference of the North American Chapter of the Association for Computational Linguistics</title>
				<imprint>
			<date type="published" when="2019">2019. 2019</date>
			<biblScope unit="page" from="724" to="728" />
		</imprint>
	</monogr>
	<note>NAACL 2019</note>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mccandlish</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Henighan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">B</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Child</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Amodei</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2001.08361</idno>
		<title level="m">Scaling laws for neural language models</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Carbonell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1906.08237</idno>
		<title level="m">XLNet: Generalized Autoregressive Pretraining for Language Understanding</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Joshi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Levy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Stoyanov</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1907.11692</idno>
		<title level="m">RoBERTa: A robustly optimized bert pretraining approach</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Lample</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Conneau</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1901.07291</idno>
		<title level="m">Cross-lingual language model pretraining</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Design challenges and misconceptions in named entity recognition</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ratinov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Roth</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Thirteenth Conference on Computational Natural Language Learning</title>
				<meeting>the Thirteenth Conference on Computational Natural Language Learning</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="147" to="155" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Parallel data, tools and interfaces in OPUS</title>
		<author>
			<persName><forename type="first">J</forename><surname>Tiedemann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">LREC</title>
				<imprint>
			<date type="published" when="2012">2012. 2012</date>
			<biblScope unit="page" from="2214" to="2218" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Ortiz Suárez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Sagot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Romary</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019</title>
				<meeting>the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019<address><addrLine>Cardiff</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019-07">July 2019, 2019</date>
			<biblScope unit="page" from="9" to="16" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
