<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ELiRF-UPV at IroSvA: Transformer Encoders for Spanish Irony Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jose-Angel Gonzalez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Llu s-Felip Hurtado</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ferran Pla</string-name>
          <email>fplag@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>VRAIN: Valencian Research Institute for Arti cial Intelligence Universitat Politecnica de Valencia</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>278</fpage>
      <lpage>284</lpage>
      <abstract>
        <p>This paper describes the participation of ELiRF-UPV team at the three subtasks proposed at IroSvA 2019 shared task. We have developed a model based on Transformer Encoders and Spanish Twitter embeddings learned from a large amount of tweets downloaded at our laboratory. Transformer Encoders are able to model long-range complex relationships among terms in a text without convolutional or recurrent layers. We addressed the three subtasks, related to three Spanish variants, using the same model. The results obtained on the validation corpus seems to con rm the adequacy of the proposed model for the irony detection task. In the nal ranking, our proposal is the only system that consistently outperforms the baselines of the organizers, being the rst ranked system by a considerable margin of Macro F1 averaged on the three subtasks.</p>
      </abstract>
      <kwd-group>
        <kwd>IroSvA19</kwd>
        <kwd>Irony</kwd>
        <kwd>Spanish Variants</kwd>
        <kwd>Transformer Encoders</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Irony is a rhetorical device in which words are used in such a way that their
intended meaning is di erent from the actual meaning of the words. The
automatic detection of irony is an emerging topic in many natural language processing
tasks. It has important implications in the nal performance of some
applications that need automatic processing of texts, mainly if a semantic analysis is
required. For example, in tasks of sentiment analysis, polarity tends to change
when irony is used. In the Semeval workshop framework, tasks such as Task-11:
Sentiment Analysis of Figurative Language in Twitter at SemEval 2015 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], or
Task 3: Irony Detection in English tweets at SemEval 2018 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] have been
proposed to quantify the impact of gurative language on the Sentiment Analysis
task for the English language.
      </p>
      <p>
        In this paper, we describe the main characteristics of the system designed
by the ELiRF-UPV team to address the tasks proposed at the IroSvA 2019
shared task [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. IroSvA is focused on Spanish language from Spain, Mexico and
Cuba. The task is structured into three subtasks, each one for predicting whether
messages are ironic or not in one of the three Spanish variants.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>System</title>
      <p>In this section, we discuss the system architecture proposed to address all three
IroSvA19 sub-tasks as well as the description of the resources used and the
preprocessing applied to the tweets.
2.1</p>
      <sec id="sec-2-1">
        <title>Resources and preprocessing</title>
        <p>
          In order to learn a word embedding model for Twitter in Spanish, we downloaded
87 million tweets of several Spanish variants. To provide the embedding layer of
our system with a rich semantic representation on the Twitter domain, we use
300-dimensional word embeddings extracted from a skip-gram model [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] trained
with the 87 million tweets by using Word2Vec framework [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>We have applied the same preprocessing to all the given data, both the tweets
used to learn the Word2Vec embeddings model and those provided by the
organization to learn the irony detection model. Firstly, a case-folding process is
applied to all the tweets; Secondly, we tokenized the tweets by using
TokTokTokenizer from NLTK. Thirdly, user mentions, hashtags and URLS are replaced
by three generic-class tokens (user, hashtag and url respectively); Finally,
elongated tokens are diselongated allowing the same vowel to appear only twice
consecutively in a token (e.g. jaaaa becomes jaa).
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Transformer Encoders</title>
        <p>
          Our irony detection system is based on the Transformer [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] model. Initially
proposed for machine translation, the Transformer model dispenses with
convolution and recurrences to learn long-range relationships. Instead of this kind
of mechanisms, it relies on multi head self-attention, where multiple attentions
among the terms of a sequence are computed in parallel to take into account
di erent relationships among them.
        </p>
        <p>Concretely, we use only the encoder part in order to extract vector
representations that are useful to determine the presence of irony. We denote this
encoding part of the Transformer model as Transformer Encoder. Figure 1 shows
a representation of the proposed architecture for irony detection.</p>
        <p>
          The input of the model is a tweet X = fx1; x2; :::; xT : xi 2 f0; :::; V gg where
T is the maximum length of the tweet and V is the vocabulary size. This tweet is
sent to a d-dimensional xed embedding layer, E, initialized with the weights of
our embedding model. Moreover, to take into account positional information we
also experimented with the sine and cosine functions proposed in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. After the
combination of the word embeddings with the positional information, dropout
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] was used to drop input words with a certain probability p. On top of these
representations, N x transformer encoders are applied which relies on multi-head
scaled dot-product attention. To do this we used an architecture similar to the
one described in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. It includes the layer normalization [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and the residual
connections.
        </p>
        <p>Due to a vector representation is required to train classi ers on top of these
encoders, a global average pooling mechanism was applied to the output of the
last encoder, and it is used as input to a feed-forward neural network, with only
one hidden layer, whose output layer computes a probability distribution over
the the two classes of the task C = fIronic; N oIronicg.</p>
        <p>
          We use Adam as update rule with 1 = 0:9 and 2 = 0:999 and Noam as
learning rate schedule [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] with 15 warmup steps. Weighted cross entropy is
used as loss function due to the distribution of the classes is biased towards
the N oIronic class in a proportion of 2:1 on all the given corpora. The same
proportion is used as weight terms for cross entropy loss function.
        </p>
        <sec id="sec-2-2-1">
          <title>Softmax</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>Feed-Forward</title>
        </sec>
        <sec id="sec-2-2-3">
          <title>Global Pooling</title>
        </sec>
        <sec id="sec-2-2-4">
          <title>Add &amp; Norm</title>
        </sec>
        <sec id="sec-2-2-5">
          <title>Feed-Forward</title>
        </sec>
        <sec id="sec-2-2-6">
          <title>Add &amp; Norm</title>
        </sec>
        <sec id="sec-2-2-7">
          <title>Multi-Head</title>
        </sec>
        <sec id="sec-2-2-8">
          <title>Attention</title>
        </sec>
        <sec id="sec-2-2-9">
          <title>Embedding</title>
        </sec>
        <sec id="sec-2-2-10">
          <title>Input</title>
        </sec>
        <sec id="sec-2-2-11">
          <title>Positional</title>
        </sec>
        <sec id="sec-2-2-12">
          <title>Encoding Nx</title>
          <p>3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>The three subtasks proposed at IroSvA19 have the same goal: determine if a
text sample is ironic or not according to a given context. The di erences among
them are the Spanish variant in which the text is written and the kind of text
to be classi ed. Subtask A aims to detect irony in Spanish tweets from Spain,
Subtask B aims to detect irony in Mexican Spanish tweets, and Subtask C aims
to detect irony in Spanish news comments from Cuba. In this work we used only
the text sample, dispensing with the context.</p>
      <p>In order to address the three subtasks, IroSvA19 organization provided three
sample sets (one per subtask). Each set is composed by 2,400 labeled documents;
1,600 of which are labeled as N oIronic and the remaining 800 are labeled as
Ironic. We divided each set provided by the organization into two subsets, a
training set of 2,100 samples and a development set of 300 samples. To do this,
we selected 200 N oIronic and 100 Ironic samples for development, maintaining
the 2:1 imbalance towards the N oIronic class both in the training and the
development sets.</p>
      <p>During the training phase, we xed some hyper-parameters, concretely: dk =
64, dff = d, T = 50, h = 8 and batch size = 32. Another hyper-parameters
such as p on warmup steps were set following some preliminary experiments to
p = 0:7 and warmup steps = 15 epochs.</p>
      <p>
        Moreover, we compare our proposal, which is based on Transformer Encoders
(TE), with another deep learning systems such as Deep Averaging Networks
(DAN) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and Attention Long Short Term Memory Networks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] (Att-LSTM)
that are commonly used in related text classi cation tasks obtaining very
competitive results [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Also it is interesting to observe how some system mechanisms, like the
positional encodings, or hyper-parameters like N x a ect to the results obtained in
terms of macro-F1 (M F1), macro-recall (M R), macro-precision (M P ) and class
level metrics ((F1i; Pi; Ri) : i 2 0 : N oIronic; 1 : Ironic). Concretely, we tried to
remove the positional information and 1 N x 2 encoders. All these variants
are applied only to the spanish subtask and the best two con gurations are used
also in the remaining subtasks. All these results are shown in Table 1.</p>
      <p>As shown in Table 1, for the 1-TE-Pos and 2-TE-Pos systems, the positional
encoding information harms the performance of the system. Moreover, the results
obtained with 1-TE-Pos are very similar to those obtained with 1-layer
AttLSTM, that seems to indicate that the positional information, by using positional
encodings or the internal memory of the LSTM, is not useful for the Spanish
subtask.</p>
      <p>It is interesting to see that when the positional information is not used,
only one encoder behaves well, however, using N x = 2 in this case, hurts the
performance of the system in comparison to N x = 1. This e ect does not happen
when the positional information is considered, which seems to indicate that a
large number of parameters are required to take into account the positional
information.</p>
      <p>The system 1-TE-NoPos outperforms the other systems, almost in all metrics,
except on the precision over the class 1 and the recall over the class 0 with respect
to DAN. Moreover, the F1 over the class 0 is very similar between both systems.
However, the improvements provided by the 1-TE-NoPos system ( 4.5 points of
F1 on the class 1, precision and recall on the class 0 as long as the improvement
of 12 points in the recall of the class 1) make this system more competitive than
DAN in terms of the macro metrics.</p>
      <p>Then, due to these two systems are the most competitive on the development
set of the ES subtask, we experimented with these architectures in the other
subtasks to observe their behaviour.</p>
      <p>On the MX subtask, the results between both systems are similar, obtaining
again the system 1-TE-NoPos the best results in all the metrics. However, on
the CU subtask, the di erences among the results of both systems are bigger,
with improvements of 9 points of M F1, M P and M R.</p>
      <p>
        Finally, our best system 1-TE-NoPos (ELiRF-UPV) is used to label the test
set. The results obtained are shown in Table 2. Our system outperforms the
proposed baselines [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] in all the metrics, by a margin of 2 to 4 points in terms
of M F1 with respect to the best baseline. Moreover, Figure 3 shows the best
ve participants in the nal ranking of the competition, where our proposal is
the best ranked system by a margin of 2.5 points of M F1 averaged on the three
subtasks with respect to the second ranked system.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>We have proposed a system based on the encoder part of the Transformer
architecture in order to extract useful word representations that are discriminative to
decide the presence of irony on sort texts. The results obtained by our system
are very promising especially considering they have been obtained without an
extensive experimentation on the hyperparameters of the model. This opens the
door to future improvements by exploring modi cations on the architecture and
its hyperparameters.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This work has been partially supported by the Spanish MINECO and FEDER
founds under project AMIC (TIN2017-85854-C4-2-R) and the GiSPRO project
(PROMETEU/2018/176). Work of Jose-Angel Gonzalez is nanced by
Universitat Politecnica de Valencia under grant PAID-01-17.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ba</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiros</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.E.:
          <article-title>Layer normalization</article-title>
          .
          <source>CoRR abs/1607</source>
          .06450 (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Francisco Rangel, Paolo Rosso,
          <string-name>
            <surname>M.F.S.:</surname>
          </string-name>
          <article-title>A low dimensionality representation for language variety identi cation</article-title>
          .
          <source>In: 17th International Conference on Intelligent Text Processing and Computational Linguistics</source>
          ,
          <source>CICLing'16</source>
          . Springer-Verlag,
          <source>LNCS(9624)</source>
          , pp.
          <fpage>156</fpage>
          -
          <lpage>169</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ghosh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veale</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shutova</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barnden</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reyes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          : SemEval-2015 task 11:
          <article-title>Sentiment analysis of gurative language in twitter</article-title>
          .
          <source>In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval</source>
          <year>2015</year>
          ). pp.
          <volume>470</volume>
          {
          <fpage>478</fpage>
          . Association for Computational Linguistics, Denver, Colorado (Jun
          <year>2015</year>
          ). https://doi.org/10.18653/v1/
          <fpage>S15</fpage>
          -2080
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hurtado</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pla</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>ELiRF-UPV en TASS 2017: Analisis de Sentimientos en Twitter basado en Aprendizaje Profundo (ELiRF-UPV at TASS 2017: Sentiment Analysis in Twitter based on Deep Learning)</article-title>
          .
          <source>In: Proceedings of TASS 2017: Workshop on Semantic Analysis at SEPLN, TASS@SEPLN</source>
          <year>2017</year>
          ,
          <article-title>co-located with 33nd SEPLN Conference (SEPLN</article-title>
          <year>2017</year>
          ), Murcia, Spain,
          <year>September 18th</year>
          ,
          <year>2017</year>
          . pp.
          <volume>29</volume>
          {
          <issue>34</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hurtado</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pla</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>ELiRF-UPV en TASS 2018: Analisis de Sentimientos en Twitter basado en Aprendizaje Profundo (ELiRF-UPV at TASS 2018: Sentiment Analysis in Twitter based on Deep Learning)</article-title>
          .
          <source>In: Proceedings of TASS 2018: Workshop on Semantic Analysis at SEPLN, TASS@SEPLN</source>
          <year>2018</year>
          ,
          <article-title>co-located with 34nd SEPLN Conference (SEPLN</article-title>
          <year>2018</year>
          ), Sevilla, Spain,
          <year>September 18th</year>
          ,
          <year>2018</year>
          . pp.
          <volume>37</volume>
          {
          <issue>44</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural Comput</source>
          .
          <volume>9</volume>
          (
          <issue>8</issue>
          ),
          <volume>1735</volume>
          {1780 (Nov
          <year>1997</year>
          ). https://doi.org/10.1162/neco.
          <year>1997</year>
          .
          <volume>9</volume>
          .8.
          <fpage>1735</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manjunatha</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boyd-Graber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daume</surname>
            <given-names>III</given-names>
          </string-name>
          , H.:
          <article-title>Deep unordered composition rivals syntactic methods for text classi cation</article-title>
          .
          <source>In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)</source>
          . pp.
          <volume>1681</volume>
          {
          <fpage>1691</fpage>
          . Association for Computational Linguistics, Beijing, China (Jul
          <year>2015</year>
          ). https://doi.org/10.3115/v1/
          <fpage>P15</fpage>
          -1162
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2</source>
          . pp.
          <volume>3111</volume>
          {
          <fpage>3119</fpage>
          . NIPS'
          <volume>13</volume>
          , Curran Associates Inc.,
          <source>USA</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ortega-Bueno</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rangel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , Hernandez Far as,
          <string-name>
            <given-names>D.I.</given-names>
            ,
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Montes-</surname>
          </string-name>
          y-Gomez,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Medina</given-names>
            <surname>Pagola</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.E.</surname>
          </string-name>
          :
          <article-title>Overview of the Task on Irony Detection in Spanish Variants</article-title>
          .
          <source>In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2019</year>
          ),
          <article-title>co-located with 34th Conference of the Spanish Society for Natural Language Processing (SEPLN</article-title>
          <year>2019</year>
          ).
          <article-title>CEUR-WS.org (</article-title>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
          </string-name>
          , R.:
          <article-title>Dropout: A simple way to prevent neural networks from over tting</article-title>
          .
          <source>Journal of Machine Learning Research 15</source>
          ,
          <year>1929</year>
          {
          <year>1958</year>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Van Hee</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lefever</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoste</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Semeval-2018 task 3 : irony detection in english tweets</article-title>
          .
          <source>In: Proceedings of The 12th International Workshop on Semantic Evaluation</source>
          . pp.
          <volume>39</volume>
          {
          <fpage>50</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uszkoreit</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>A.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaiser</surname>
          </string-name>
          , L.u.,
          <string-name>
            <surname>Polosukhin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Attention is all you need</article-title>
          . In: Guyon,
          <string-name>
            <given-names>I.</given-names>
            ,
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.V.</given-names>
            ,
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Vishwanathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Garnett</surname>
          </string-name>
          ,
          <string-name>
            <surname>R</surname>
          </string-name>
          . (eds.)
          <source>Advances in Neural Information Processing Systems</source>
          <volume>30</volume>
          , pp.
          <volume>5998</volume>
          {
          <fpage>6008</fpage>
          . Curran Associates, Inc. (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>