<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>IPR: The Semantic Textual Similarity and Recognizing Textual Entailment systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rui Rodrigues</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paula Couto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Irene Rodrigues</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro de Matem ́atica e Aplica ̧co ̃es (CMA), FCT, UNL Departamento de Matema ́tica</institution>
          ,
          <addr-line>FCT, UNL</addr-line>
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Laborat ́orio de Informa ́tica, Sistemas e Paralelismo (LISP) Departamento de Informa ́tica, Universidade de E</institution>
        </aff>
      </contrib-group>
      <fpage>39</fpage>
      <lpage>47</lpage>
      <abstract>
        <p>We describe IPR's systems developed for ASSIN2 (Evaluating Semantic Similarity and Textual Entailment). Our best submission ranked first in the Semantic Textual Similarity task and second in the Recognizing Textual Entailment task. These systems were developed using BERT, for each task we added one layer to a pre-trained Bert model and fine-tuned the whole task network.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>ASSIN2 train data and the ASSIN train and test data to fine-tune the resulting
network, giving rise to our systems.</p>
      <p>IPR’s best submission in the STS task ranked first (Pearson correlation).
And in the RTE task ranked second (F1 score).</p>
      <p>In these semantic tasks, the strategy of using a BERT language model
finetuned with the classification task data has obtained very good results, in section
2 it is provided an explanation of the method we followed. Section 3 describes our
approach to the specific tasks and presents the results we achieved on these tasks
along with instances where the systems did not perform so well. Section 4
discusses our performance in ASSIN2 and includes also future plans for improving
the systems.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Using BERT for NLP semantic tasks</title>
      <p>
        BERT models use what is known as Word Embeddings, models where words are
represented as real number vectors in a predefined vector space. The use of real
number vectors to represent words dates back to 1986 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. More recently, in 2003,
in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] a vector representation of words is obtained by using a neural network and
the vectors are elements of a probabilistic language model.
      </p>
      <p>
        In 2008, [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the network that produced such a vector representation was
trained to create a language model together with several NLP tasks:
part-ofspeech tags, chunks, named entity tags, semantic roles, semantic similar words.
      </p>
      <p>
        In 2013 a significant advance was achieved with Word2vec [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>Word2vec is a shallow model, log-bilinear, without non-linearities that
enables the use of higher dimension vector representation and can be training on
larger corpus. It achieved important results in several NLP tasks involving word
semantic and syntactic similarity.</p>
      <p>
        GloVe [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], is also a log-bilinear model but it’s train di↵ ers from the word2vec
train since it uses global occurences of matrices. Glove achieved important results
on NLP tasks like word analogy and Named Entities Recognition.
      </p>
      <p>In the these models, word representations do not distinguish the di↵ erent
meanings of some words: the vector representation is always the same for each
word independently of context 3.</p>
      <p>
        More recently, language models ELMo [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and ULMFit [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] used recurrent
bidirectional neural networks (LSTMs) to generate word vector representation
of words based on the context. This contextualized word representations allowed
improvements that brought these models to state-of-art in many NLP tasks.
      </p>
      <p>
        BERT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a Transformer neural network [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] which has a better
integration of bidirectional context. The use of feedforward neural networks instead of
recurrent allows for a much bigger model. BERT achieves on most NLP tasks
better results than any previous models. The version we used, BERT-Base has
110 million parameters.
      </p>
      <p>
        Each BERT’s input is one sentence or a pair of sentences. Each sentence is
previously converted to a sequence of tokens using ‘WordPiece’ tokenizer [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
3 see [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for more details on the Word embeddings
More concretely each word is converted in one or more tokens: tokens are more
meaningful when they correspond to frequent words, su xes or prefixes. In our
case, the output of BERT is a vector of 768 floats.
      </p>
      <p>BERT is pre-trained simultaneously on two tasks:
– 10% of the words in a sentence are masked and BERT tries to predict them.
– Two sentences, A and B, are given and BERT must decide if B is the sentence
that follows A.</p>
      <p>Two network layers (in parallel) are added to BERT in order to train it on
these tasks.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Our systems</title>
      <p>In this section we describe our systems approach using BERT, a Portuguese
Corpus and the ASSIN1 and ASSIN2 datasets.
3.1</p>
      <sec id="sec-3-1">
        <title>BERT versions</title>
        <p>The authors of BERT made available a pre-trained multilingual version. They
used 104 languages, including Portuguese, Arabic, Russian, Chinese and Japanese.
The resulting vocabulary consists of 119547 tokens. To compare with the
English BERT version which vocabulary contains 30000 tokens. Therefore, the set
of tokens, in the multilingual version, can not be well adapted to each language
.</p>
        <p>The example below presents the tokenization of an ASSIN2 sentence where
tokens are separated by a space:</p>
        <p>O meni ##no e a meni ##na est˜ao br ##in ##cando na academia ao
ar livre4</p>
        <p>We can see that words as “menina” and the verb “brincar” do not have a
natural decomposition in tokens.</p>
        <p>
          The Multilingual Portuguese tokenization leads us to suspect that the use
of this BERT version can not be optimal for Portuguese and possibly for other
languages. So to adjust the network resources (weights) to Portuguese we
decided to build a new version by using this Multilingual version fine-tuned in
a Portuguese corpus5. We used Portuguese Wikipedia and the journal extracts
of Publico and Folha de S˜ao Paulo included in corpus CHAVE [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The set of
tokens in this new BERT version, Multilingual fine-tuned in Portuguese, is the
same set of the Multilingual version.
        </p>
        <p>To try to improve the tokenization we trained BERT from scratch on the same
Portuguese Corpus used to fine-tune the Multilingual version. We used a set of
32000 tokens constructed only from our Portuguese Corpus. The tokenization
we obtain for the previous sentence is now:
4 Sentence translation: The boy and the girl are playing at the gym outdoor
5 Note that this fine-tuning uses the original Multilingual tokenization.
O menino e a menina est˜ao brinca ##ndo na academia ao ar livre6
In this example, only the word “brincando” (“playing”) is represented by two
or more tokens, it is divided into a verb lemma and a common termination. This
is one of the advantages of creating a tokens vocabulary based on the Portuguese
language.</p>
        <p>This Portuguese BERT version was one of those used in our ASSIN2 tasks
submissions.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Training datasets</title>
        <p>Since the previous ASSIN challenge data (ASSIN1) is available, we used it’s train
and test datasets to fine-tune the network in each task.</p>
        <p>
          The ASSIN dataset for the RTE task is annotated with three-labels:
entailment, paraphrase and neutral[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], for the STS task is annotated with a value
between 1 and 5 as in ASSIN2. The ASSIN1 data has a subset for European
Portuguese and another for Brazilian Portuguese. It is based on news with some
linguistic complexity phenomena like temporal expressions.
        </p>
        <p>The ASSIN2 dataset has about 10,000 sentence pairs with no linguist
challenges: 6,500 used for training, 500 for validation, and 2,448 for test. It is available
at https://sites.google.com/view/assin2/.</p>
        <p>In figure 1 we present three ASSIN2 dataset examples. The tag entailment
can have the values “Entailment”/“None” and similarity a value between 1 and
5.</p>
        <p>– entailment=“None” id=“12” similarity=“2.4”</p>
        <p>Um homem esta´ tocando teclado7</p>
        <p>Um homem esta´ tocando um viola˜o el´etrico8
– entailment=“Entailment” id=“451” similarity=“1.5”</p>
        <p>Um cara esta´ brincando animadamente com uma bola de meia9</p>
        <p>O homem na˜o esta´ tocando piano10
– entailment=“Entailment” id=“459” similarity=“4.7”</p>
        <p>Um homem esta´ andando de cavalo na praia11
Um cara esta´ montando um cavalo12</p>
        <p>To train our systems in each task, several epochs of mini-batch gradient
descent were run until the results on the dev set started to decline. In figure 2
we present the MSE values for a typical training run of the Similarity task. In
this example, we used 33458 steps to train the model. Each epoch corresponds
to 257 steps. In Figure 3 we present Pearson correlation values for the same
training run.</p>
        <p>In Recognizing Textual Entailment task, the loss used for training was the
binary cross-entropy, while in the Semantic Textual Similarity task, the loss used
for training was Mean Square Error although the main metric for the task was
Pearson Correlation.
3.4</p>
      </sec>
      <sec id="sec-3-3">
        <title>Recognizing Textual Entailment</title>
        <p>In this task, our starting point was always the Multilingual-Portuguese
finetuned BERT.</p>
        <p>Table 3.4 presents our results for 3 systems that were built with three sets
of training data:
1. ASSIN2 training data (ASSIN2)
2. ASSIN1 Brazilian Portuguese training and test data plus ASSIN2 training
data (ASSIN2+ASSIN1:ptbr)
3. ASSIN1 Brazilian and European Portuguese training and test data plus
ASSIN2 training data (ASSIN2+ASSIN1:ptbr+pteu)</p>
        <p>The use of ASSIN2+ASSIN1:ptbr training data had slightly better results
than the others as it can be seen in Table 3.4 in bold. We used 25 epochs to
train the system.</p>
        <p>Our best results are:
– When the system is evaluated on dev, the ASSIN2 dataset used for
testing/improving our systems, F1 - 0.956, Accuracy - 95.60.
– When the system is evaluated on test, the ASSIN2 final competition dataset,
F1 - 0.876, Accuracy - 87.58.</p>
        <p>Surprisingly, when we use ASSIN2+ASSIN1:ptbr+pteu training data the
results get worse in both sets: dev and test. This can be due to the fact that
ASSIN2 dataset was built with Brazilian Portuguese.</p>
        <p>When only ASSIN2 is used as training data, the results get even worse in
both sets, this confirms that the use of more data in the training can improve
our systems.
3.5</p>
      </sec>
      <sec id="sec-3-4">
        <title>Semantic Textual Similarity</title>
        <p>In this task we always used ASSIN1+ptbr+pteu data and the ASSIN2 training
data to fine-tune some BERT version.</p>
        <p>We tried the three BERT versions described above.</p>
        <p>As Table 2 reports, the best results were achieved with the Multilingual
version without fine-tuning to Portuguese and we used 235 epochs for training
in the best submission.</p>
        <p>Multilingual best results were:
ASSIN2 + ASSIN1:ptbr+ptpt</p>
        <p>training dataset
ASSIN2 + ASSIN1:ptbr
only ASSIN2
– When the system is evaluated on dev, the ASSIN2 dataset used for
testing/improving our systems, Pearson Correlation - 0.968, MSE - 0.078.
– When the system is evaluated on test, the ASSIN2 final competition dataset,
Pearson Correlation - 0.826, MSE - 0.523.</p>
        <p>The Multilingual BERT fine-tuned Portuguese version that was submitted
to ASSIN2 contained an error, so in Table 2 we present the results for the non
o cial version. As you can see, in the Table, this version has a lower performance
than the Multilingual version. The Portuguese version has the worst results, but
encourage us to improve it by using more Portuguese data in the training of
BERT.</p>
        <p>BERT version</p>
        <p>
          Multilingual
Multilingual fine-tuned Portuguese
(non o cial)
Portuguese
Our results in the ASSIN2 challenge, see Table 3, first place in the Similarity
task and second place in Entailment task, show that fine-tuning BERT is at
the moment one of the best approaches on Portuguese semantic NLP tasks.
We expect to improve the results by properly training BERT from scratch on
a big and adapted Portuguese Corpus that has still to be assembled. Di↵ erent
versions of BERT need to be considered. We used BERT-Base but a larger
version, BERT-Large (340 million parameters), achieved the better results on
English NLP tasks. Given that the performance of a model depends also on the
available training data and for the Portuguese language the available data is not
so large as for English, we plan to experiment with ALBERT [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], a more light
version of BERT.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>This work was partially supported by the Fundac¸˜ao para a Ciˆencia e a
Tecnologia (Portuguese Foundation for Science and Technology) through the project
UID/MAT/00297/2019 (Centro de Matem´atica e Aplicac¸˜oes) and the grant
UID/CEC/4668/2016 (Laborat´orio de Inform´atica, Sistemas e Paralelismo).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ducharme</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vincent</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Janvin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A neural probabilistic language model</article-title>
          .
          <source>J. Mach. Learn. Res</source>
          .
          <volume>3</volume>
          ,
          <fpage>1137</fpage>
          -
          <lpage>1155</lpage>
          (
          <year>Mar 2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Collobert</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weston</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A unified architecture for natural language processing: Deep neural networks with multitask learning</article-title>
          .
          <source>In: Proceedings of the 25th International Conference on Machine Learning</source>
          . pp.
          <fpage>160</fpage>
          -
          <lpage>167</lpage>
          . ICML '08,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          : BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers). pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . Association for Computational Linguistics, Minneapolis,
          <source>Minnesota (Jun</source>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Fonseca</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , Borges dos Santos,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Criscuolo</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          , Alu´ısio, S.:
          <article-title>Vis˜ao geral da avalia¸ca˜o de similaridade semaˆntica e inferˆencia textual</article-title>
          .
          <source>Linguama´tica 8(2)</source>
          ,
          <fpage>3</fpage>
          -
          <lpage>13</lpage>
          (12
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McClelland</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rumelhart</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          :
          <article-title>Distributed representations</article-title>
          .
          <source>In: Parallel Distributed Processing: Explorations in the Microstructure of Cognition</source>
          , Vol.
          <volume>1</volume>
          : Foundations, p.
          <fpage>77</fpage>
          -
          <lpage>109</lpage>
          . MIT Press, Cambridge, MA, USA (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Howard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Universal language model fine-tuning for text classification</article-title>
          .
          <source>In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          . pp.
          <fpage>328</fpage>
          -
          <lpage>339</lpage>
          . Association for Computational Linguistics, Melbourne,
          <source>Australia (Jul</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lan</surname>
            ,
            <given-names>Z.Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goodman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gimpel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharma</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soricut</surname>
          </string-name>
          , R.: Albert:
          <article-title>A lite bert for self-supervised learning of language representations</article-title>
          . ArXiv abs/
          <year>1909</year>
          .11942 (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>E cient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.: Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In: Empirical Methods in Natural Language Processing (EMNLP)</source>
          . pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long Papers). pp.
          <fpage>2227</fpage>
          -
          <lpage>2237</lpage>
          . Association for Computational Linguistics, New Orleans,
          <source>Louisiana (Jun</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Real</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fonseca</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , Gonc¸alo Oliveira, H.:
          <article-title>The ASSIN 2 shared task: Evaluating Semantic Textual Similarity and Textual Entailment in Portuguese</article-title>
          .
          <source>In: Proceedings of the ASSIN 2 Shared Task: Evaluating Semantic Textual Similarity and Textual Entailment in Portuguese</source>
          . p. [In this volume].
          <source>CEUR Workshop Proceedings</source>
          , CEUR-WS.org (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rocha</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>The key to the first clef with portuguese: Topics, questions and answers in chave</article-title>
          . In: Peters,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.J.F.</given-names>
            ,
            <surname>Kluck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Magnini</surname>
          </string-name>
          ,
          <string-name>
            <surname>B</surname>
          </string-name>
          . (eds.)
          <article-title>Multilingual Information Access for Text, Speech and Images</article-title>
          . pp.
          <fpage>821</fpage>
          -
          <lpage>832</lpage>
          . Springer Berlin Heidelberg, Berlin, Heidelberg (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Schuster</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nakajima</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Japanese and korean voice search</article-title>
          .
          <source>2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          pp.
          <fpage>5149</fpage>
          -
          <lpage>5152</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uszkoreit</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>A.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaiser</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polosukhin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Attention is all you need</article-title>
          .
          <source>In: Proceedings of the 31st International Conference on Neural Information Processing Systems</source>
          . pp.
          <fpage>6000</fpage>
          -
          <lpage>6010</lpage>
          . NIPS'
          <volume>17</volume>
          , Curran Associates Inc.,
          <source>USA</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>