<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Projecting Heterogeneous Annotations for Named Entity Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rodrigo Agerri</string-name>
          <email>rodrigo.agerri@ehu.eus</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>German Rigau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>HiTZ Center - Ixa, University of the Basque Country UPV/EHU</institution>
        </aff>
      </contrib-group>
      <fpage>45</fpage>
      <lpage>51</lpage>
      <abstract>
        <p>In this paper we describe our participation in the CAPITEL at IberLEF 2020 shared task on Named Entity Recognition (NER). Our objectives to participate in the shared task were two-fold: (i) to benchmark current rich multilingual representations of text with respect to monolingual models trained specifically for Spanish; (ii) to study various methods of projecting annotations from several sources into a final target prediction. Our results show that monolingual models, even for a large language such as Spanish, perform better in this particular NER benchmark. Furthermore, our projection method indicates that substantial gains in performance can be obtained by projecting annotations from various heterogeneous sources to obtain the final prediction. Our submission obtained the best score, substantially outperforming other participants of the CAPITEL 2020 NER task.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Named Entity Recognition</kwd>
        <kwd>Information Extraction</kwd>
        <kwd>Natural Language Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        as well as other monolingual models trained specifically for Spanish. Furthermore, we project the
annotations provided by each system into a target final prediction. The projection of several source
annotations into a target is loosely inspired by a method originally designed for projection of
annotations across languages [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Our projection method indicates that substantial gains in performance
(around 1.3 points in F1 score) can be obtained by projecting annotations from various heterogeneous
sources into a final target prediction. Our submission obtained the best score, substantially
outperforming other participants of the CAPITEL 2020 NER task.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Deep learning methods in NLP rely on the ability to represent words as continuous vectors on a low
dimensional space, called word embeddings. The first approaches generated static word embeddings
[
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ], namely, they provided a unique vector-based representation for a given word independently
of the context in which the word occurs. This means that polysemy cannot be represented. Thus, if
we consider the word ‘bank’, static word embedding approaches will generate only one vector
representation even though such word may have diferent senses, namely, ‘financial institution’,‘bench’,
etc.
      </p>
      <p>
        In order to address this problem, contextual word embeddings were proposed. The idea is to be
able to generate diferent word representations according to the context in which the word appears.
Currently there are many approaches to generate such contextual word representations, but we will
focus on those that have had a direct impact, in terms of performance, for the Named Entity
Recognition task. First, Flair [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] representations are built following a LSTM-based architecture and trained as
language models. Second, the models based on the transformer architecture [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and of which BERT
is perhaps the most popular example [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        The multilingual counterpart of BERT, called mBERT, is a single language model pre-trained from
corpora in more than 100 languages. Another standout model is XLM-RoBERTa [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] also based on the
transformer architecture which provides a pre-trained language model for 100 languages trained on
2.5 TB of Common Crawl text. Both mBERT and XLM-RoBERTa enable to perform transfer knowledge
across languages [
        <xref ref-type="bibr" rid="ref13 ref14 ref7">13, 14, 7</xref>
        ], although in this paper we will use them in a monolingual setting for
Spanish NER.
2.1. Flair
Flair refers to a system based on a BiLSTM architecture [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and to a specific type of
characterbased contextual word embeddings. Flair (embeddings and system) have been successfully applied to
sequence labeling tasks obtaining state-of-the-art results for a number of Named Entity Recognition
(NER) and Part-of-Speech tagging benchmarks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Flair embeddings consist of sequences of characters. More specifically, sentences are processed
into sequences of characters and feed into a character-level Long short-term memory (LSTM) model.
For each sentence, a forward LSTM language model processes the its sequence of characters from the
beginning of the sentence to the last character of the word we are modeling. Furthermore, a backward
LSTM performs the same operation going from the end of the sentence up to the first character of the
word. The extracted hidden states contain information propagated from the end and the beginning of
the sentence up to the first and the last character of the target word. Finally, the resulting two hidden
states are concatenated to generate the final embedding.</p>
      <p>
        Pooled embeddings are a type of Flair embeddings which consider global information in order to
generate the final word embedding [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. In this approach embeddings are kept into a memory which
is later used in a pooling operation to obtain a global word representation. This representation will
be the concatenation of all the local Flair contextualized embeddings obtained for a given word. It
should be consider that pooling operation is involved in the process of fine-tuning the Flair pre-trained
models, not in the process of training the language models themselves. We use the default pooling
operation, min, which computes a vector of all element-wise minimum values [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
2.2. Transformers
LSTM-based language models such as the one presented in the previous section cannot capture
longrange sequence information. Furthermore, they are quite hard to train at a large scale (see [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
especially Figure 7). In order to address these issues, the Transformer architecture was proposed
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], based on multi-headed self-attention and positional encoding. The most popular Transformer is
BERT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which pre-trains a Transformer encoder on the Masked Language Model (MLM) and Next
Sentence Prediction (NSP) tasks. BERT is composed by stacked layers of Transformer encoders [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
More specifically, in this paper we will use the BERT BASE configuration which contains 12 Transformer
encoder layers, a hidden size of 768 and 12 self-attention heads for a total of 110M parameters.
      </p>
      <p>The MLM task is designed as follows: For a input sequence of  tokens  1,  2, ...,   , 15% are selected
as masking candidates. From those candidates, 80% of them are masked (they are replaced with the
[MASK] token), 10% are replaced by a random word and the last 10% is left unchanged. For the NSP
task, two segments are selected from the training corpus,  and  . In 50% of the cases  is the true
next segment for  . For the rest,  is just a random segment. The model is trained to optimize the
sum of the means of the MLM and NSP likelihoods.</p>
      <p>
        It should be noted that the benefits of the NSP task during the pre-training process has been
questioned [
        <xref ref-type="bibr" rid="ref18">18, 19, 20</xref>
        ]. Thus, other transformer proposals such as RoBERTa train without the NSP task,
showing strong performance on the same downstream tasks.
      </p>
      <p>
        XLM-RoBERTa relies exclusively on the MLM objective. The biggest update that XLM-Roberta
ofers is a significantly increased amount of training data, 2.5TB of Common Crawl clean data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. As
for BERT, in this paper we use the base version of XLM-RoBERTa. The reason being that their base
versions fit for fine-tuning into a standard GPU card with 12GB of RAM.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Setup</title>
      <p>Named entities were originally annotated using the BIO encoding which identifies the Beginning, the
Inside and the Outside of named entities. Later on the BILOU model1 was proposed to mark tokens
as the Beginning, the Inside and the Last tokens of multi-token entities as well as Unit-length entities
[21]. Although the CAPITEL corpus is originally released using the BILOU model, we experiment
with both type of encodings.</p>
      <p>The CAPITEL (Corpus del Plan de Impulso a las Tecnologías del Lenguaje) has been developed by
the PlanTL, the Royal Spanish Academy (RAE) and the Secretariat of State for Digital Advancement
(SEAD) of the Ministry of Economy. These organizations signed an agreement for developing a
linguistically annotated corpus of Spanish news articles, with the objective of extending the language
resource infrastructure for the Spanish language. CAPITEL is composed of contemporary news
articles and contains annotations for Universal Dependencies and Named Entities. The NER portion of
the corpus contains around one million words.</p>
      <p>For the experiments performed for this paper, we use a number of publicly available models:
1Nowadays also known as the BIOES encoding: Beginning, Inside, Outside, End of entity and Single entity.</p>
      <sec id="sec-3-1">
        <title>1. Multilingual BERT (mBERT).</title>
        <p>2. XLM-RoBERTa (base).
3. BETO, a monolingual Spanish BERT trained with Wikipedia and Spanish data from the OPUS
corpus [22].</p>
        <p>4. Flair oficial models for Spanish.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Additionally we trained the following monolingual language models for Spanish:</title>
        <p>1. Flair-GW: Flair character-based language model trained on the Spanish Wikipedia and the
Gigaword 3rd edition corpus, containing around 11GB of text.
2. Flair-Oscar: Flair language model trained on the OSCAR Spanish corpus [23], which contains
157GB of Common Crawl text cleaned and deduplicated.</p>
        <p>The Flair embeddings for Flair-GW and Flair-Oscar were trained with the following parameters:
Hidden size 2048, sequence length of 250, and a mini-batch size of 100. The rest of the parameters
were left in their default setting. For Flair-GW, training was done for 5 epochs over the full training
corpus. The training took around 5 days in a Nvidia Titan V GPU. With respect to Flair-Oscar, only
one epoch was performed, requiring around a month to complete it.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>combined with the FastText embeddings trained on Wikipedia. In fact, Flair-Oscar was the best single
system by a substantial margin. Apart from this, S2 and S3 show the small gains obtained by adding
the 10 percent used for development for the final evaluation. Furthermore, S3 was trained when
the progress of training the language model was at half epoch, whereas S4 was trained using the final
Oscar language model based on one epoch. Finally, S5 is the same model as S1 but using BIO encoding
instead of the original BILOU encoding from the CAPITEL corpus. The best overall invididual system
was S4, significantly outperforming the multilingual and monolingual Transformer models.</p>
      <p>With respect to the transformer models, it can be seen that in general their results are lower than
those obtained by the Flair-Oscar models. During the development phase they all performed very
closely although in the final, oficial results XLM-RoBERTa was slightly superior to the rest.
Furthermore, results also show that mBERT performed worst and that XLM-RoBERTa obtains very similar
results to the monolingual models.</p>
      <p>The last three rows of Table 1 report the three best projections. Once we had the best 8 systems, we
proceeded to project their predictions by means of any possible combination of the 8 systems. The
best three systems were picked based on two criteria: the F1 score obtained on the development data
and the number of No-agreements recorded by each projection.</p>
      <p>The projections were performed using 5 predictions as source. We tested various strategies and the
one we finally used to report the final results was, interestingly enough, the simplest of them all. It
uses a very simple methodology based on the number of agreements between the predicted labels of
the 5 source annotations: if agreement is &gt;= 3 then project, otherwise, project “O”.</p>
      <p>As we could not compute F1 scores on the oficial test set released by the shared task, we simply
picked the projection which recorded fewer No-agreements. This corresponds to the best overall
system (P3), which uses S3, S4, S6, S7 and S8 as source to obtain the final prediction.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Concluding Remarks</title>
      <p>In this paper we have described the experiments performed for our participation in the CAPITEL 2020
shared task on Named Entity Recognition. Even though the best results are obtained by the
FlairOscar monolingual models, our results indicate that multilingual pre-trained models such as
XLMRoBERTa are performing increasingly close to monolingual models for a large-resourced language
such as Spanish. Furthermore, we also show the benefits of projecting named entity annotations from
various heterogeneous sources in order to substantially improve performance (around 1.3 points in
F1 score over the best individual system).</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Innovation and Universities (DeepReading RTI2018-096846-B-C21, MCIU/AEI/FEDER, UE) and by
Ayudas Fundación BBVA a Equipos de Investigación Científica 2018 (BigKnowledge). Rodrigo Agerri
is funded by the RYC-2017-23647 fellowship and acknowledges the donation of a Titan V GPU by the
NVIDIA Corporation.
[19] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
RoBERTa: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692
(2019).
[20] G. Lample, A. Conneau, Cross-lingual language model pretraining, arXiv preprint
arXiv:1901.07291 (2019).
[21] L. Ratinov, D. Roth, Design challenges and misconceptions in named entity recognition, in:
Proceedings of the Thirteenth Conference on Computational Natural Language Learning, 2009,
pp. 147–155.
[22] J. Tiedemann, Parallel data, tools and interfaces in OPUS., in: LREC, volume 2012, 2012, pp.</p>
      <p>2214–2218.
[23] P. J. Ortiz Suárez, B. Sagot, L. Romary, Asynchronous pipelines for processing huge corpora
on medium to low resource infrastructures, Proceedings of the Workshop on Challenges in the
Management of Large Corpora (CMLC-7) 2019. Cardif, 22nd July 2019, 2019, pp. 9–16.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E. F.</given-names>
            <surname>Tjong Kim</surname>
          </string-name>
          <string-name>
            <surname>Sang</surname>
          </string-name>
          ,
          <article-title>Introduction to the CoNLL-2002 shared task: Language-independent named entity recognition</article-title>
          ,
          <source>in: Proceedings of CoNLL-2002</source>
          , Taipei, Taiwan,
          <year>2002</year>
          , pp.
          <fpage>155</fpage>
          -
          <lpage>158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E. F.</given-names>
            <surname>Tjong Kim Sang</surname>
          </string-name>
          , F. De Meulder,
          <article-title>Introduction to the CoNLL-2003 shared task: Languageindependent named entity recognition</article-title>
          ,
          <source>in: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4</source>
          ,
          <year>2003</year>
          , pp.
          <fpage>142</fpage>
          -
          <lpage>147</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Akbik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Blythe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vollgraf</surname>
          </string-name>
          ,
          <article-title>Contextual string embeddings for sequence labeling</article-title>
          ,
          <source>in: COLING</source>
          <year>2018</year>
          , 27th International Conference on Computational Linguistics,
          <year>2018</year>
          , pp.
          <fpage>1638</fpage>
          -
          <lpage>1649</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis</article-title>
          , MN, USA, June 2-7,
          <year>2019</year>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          , arXiv:
          <year>1911</year>
          .
          <volume>02116</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agerri</surname>
          </string-name>
          , I. San Vicente,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Saralegi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soroa</surname>
          </string-name>
          , E. Agirre,
          <article-title>Give your text representation models some love: the case for basque</article-title>
          ,
          <source>in: Proceedings of The 12th Language Resources and Evaluation Conference (LREC</source>
          <year>2020</year>
          ),
          <year>2020</year>
          , pp.
          <fpage>4781</fpage>
          -
          <lpage>4788</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Karthikeyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mayhew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <article-title>Cross-lingual ability of multilingual bert: An empirical study</article-title>
          ,
          <source>in: International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Porta-Zamorano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Espinosa-Anke</surname>
          </string-name>
          ,
          <article-title>Overview of CAPITEL Shared Tasks at IberLEF 2020: NERC and Universal Dependencies Parsing</article-title>
          ,
          <source>in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2020</year>
          ),
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Agerri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chung</surname>
          </string-name>
          , I. Aldabe,
          <string-name>
            <given-names>N.</given-names>
            <surname>Aranberri</surname>
          </string-name>
          , G. Labaka, G. Rigau,
          <article-title>Building named entity recognition taggers via parallel corpora</article-title>
          ,
          <source>in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          ),
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Corrado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , T. Mikolov,
          <article-title>Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics 5 (</article-title>
          <year>2017</year>
          )
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Advances in neural information processing systems</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Heinzerling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Strube</surname>
          </string-name>
          ,
          <article-title>Sequence tagging with contextual and non-contextual subword representations: A multilingual evaluation, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>273</fpage>
          -
          <lpage>291</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Pires</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Schlinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Garrette</surname>
          </string-name>
          ,
          <article-title>How multilingual is multilingual bert?</article-title>
          ,
          <source>in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4996</fpage>
          -
          <lpage>5001</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Bidirectional</surname>
            <given-names>LSTM</given-names>
          </string-name>
          <source>-CRF Models for Sequence Tagging</source>
          ,
          <year>2015</year>
          . arXiv:
          <fpage>1508</fpage>
          .
          <year>01991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Akbik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bergmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vollgraf</surname>
          </string-name>
          ,
          <article-title>Pooled contextualized embeddings for named entity recognition</article-title>
          ,
          <source>in: NAACL</source>
          <year>2019</year>
          ,
          <article-title>2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics</article-title>
          ,
          <year>2019</year>
          , p.
          <fpage>724</fpage>
          -
          <lpage>728</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McCandlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henighan</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Scaling laws for neural language models</article-title>
          , arXiv preprint arXiv:
          <year>2001</year>
          .
          <volume>08361</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , J. Carbonell, R. Salakhutdinov,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>XLNet: Generalized Autoregressive Pretraining for Language Understanding</article-title>
          , arXiv preprint arXiv:
          <year>1906</year>
          .
          <volume>08237</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>