<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Fully Convolutional Networks for Text Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jacob Anderson Sentim LLC Columbus</string-name>
          <email>papers@sentimllc.com</email>
        </contrib>
      </contrib-group>
      <abstract>
        <p>English. In this work I propose a new way of using fully convolutional networks for classification while allowing for input of any size. I additionally propose two modifications on the idea of attention and the benefits and detriments of using the modifications. Finally, I show suboptimal results on the ITAmoji 2018 tweet to emoji task and provide a discussion about why that might be the case as well as a proposed fix to further improve results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italian. In questo lavoro viene presentato
un nuovo approccio all'uso di fully
convolutional network per la classificazione,
adattabile a dati di input di qualsiasi
dimensione. In aggiunta vengono proposte
due modifiche basate sull'uso di
meccanismi di attention, valutandone benefici e
svantaggi. Infine, sono presentati i
risultati della partecipazione al Task ITAmoji
2018 relativo alla predizione di emoji
associate al testo di tweets, discutendo il
perché delle performance non ottimali del
sistema sviluppato e proponendo possibili
migliorie.
size output from any sized input. In text
classification tasks, this often means that the input is
fixed in size in order for the output to also have a
fixed size.</p>
      <p>
        Other recent work in language understanding
and translation uses a concept called attention.
Attention is particularly useful for language
understanding tasks as it creates a mechanism for
relating different position of a single sequence to each
other
        <xref ref-type="bibr" rid="ref11">(Vaswani et al., 2017)</xref>
        .
      </p>
      <p>In this work I propose a new way of using fully
convolutional networks for classification to allow
for any sized input length without adding or
removing data. I also propose two modifications on
attention and then discuss the benefits and
detriments of using the modified versions as compared
to the unmodified version.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Model Description</title>
      <p>The overall architecture of my fully convolutional
network design is shown in Figure 1. My model
begins with a character embedding where each
character in the input maps to a vector of size 16.
I then first apply a causal convolution with 128
filters of size 3. After which, I apply a stack of 9
layers of residual dilated convolutions with skip
connections, each of which use 128 filters of size
7. The size of 7 here was chosen by inspection, as
it converged faster than size 3 or 5 while not
consuming too much memory. Additionally, the
dilation rate of each layer of the stack doubles for
every layer, so the first layer has rate 1, then the
second layer has rate 2, then rate 4, and so on.</p>
      <p>All of the skip connections are combined with
a summation immediately followed by a ReLU to
increase nonlinearity. Finally, the output of the
network was computed using a convolution with
25 filters each of size 1, followed by a global max
pool operation. The global max pool operation
reduces the 3D tensor of size (batch size, input
length, 25) to (batch size, 25) in order to match the
expected output.</p>
      <p>
        I implemented all code using a combination of
Tensorflow
        <xref ref-type="bibr" rid="ref1">(Abadi et al., 2016)</xref>
        and Keras
        <xref ref-type="bibr" rid="ref3">(Chollet, 2015)</xref>
        . During training I used softmax
crossentropy loss with an l2 regularization penalty with
a scale of .0001. I further reduced overfitting by
adding spatial dropout
        <xref ref-type="bibr" rid="ref9">(Tompson et al., 2015)</xref>
        with a drop probability of 10% in the residual
dilated convolution layers.
1 They have since changed this limitation to 13 GB.
2.1
At the time of creating the models in this paper, I
was limited to only a Google Colab GPU, which
comes with a runtime restriction of 12 hours per
day and a half a GB of GPU memory1. While it is
possible to continue training again after the
restriction is reset, in order to maximize GPU usage,
I tried to design each iteration of the model so that
it would finish training within a 12 hour time
period.
2.2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Residual Block</title>
      <p>A residual connection is any connection which
maps the input of one layer to the output of a layer
further down in the network. Residual
connections decrease training error, increase accuracy,
and increase training speed (He et al., 2015).
2.3</p>
    </sec>
    <sec id="sec-4">
      <title>Dilated Convolution</title>
      <p>
        A dilated convolution is a convolution where the
filter is applied over a larger area by skipping
input values according to a dilation rate. This rate
usually exponentially scales with the numbers of
layers of the network, so you would look at every
input for the first layer and then every other input
for the second, and then every fourth and so on
        <xref ref-type="bibr" rid="ref10">(van den Oord, 2016)</xref>
        .
      </p>
      <p>
        In this paper, I use dilated convolutions similar
to Wavenet
        <xref ref-type="bibr" rid="ref10">(van den Oord, 2016)</xref>
        , where each
convolution has both residual and skip
connections. However, instead of the gated activation
function from the Wavenet paper, I used local
response normalization followed by a ReLU
function. This activation function was proposed by
        <xref ref-type="bibr" rid="ref5">Krizhevsky, Sutskever, and Hinton (2012</xref>
        ), and I
used it because I found this method to achieve
equal results but faster convergence.
2.4
      </p>
    </sec>
    <sec id="sec-5">
      <title>Residual Dilated Convolution</title>
      <p>
        A residual dilated convolution is a dilated
convolution with a residual connection. First, I take a
dilated convolution on the input and a linear
projection on the input. The dilated convolution and
the linear projection are added together and then
outputted. The dilated convolution also outputs as
a skip connection, which is eventually summed
together with every other skip connection later in
the network.
In this paper, I also use the idea of skip
connections from
        <xref ref-type="bibr" rid="ref6">Long, Shelhamer, and Darrell (2015</xref>
        ).
Skip connections simply connect previous layers
with the layer right before the output in order to
fuse local and global information from across the
network. In this work, the connections are all
fused together with a summation followed by a
ReLU activation to increase nonlinearity.
2.6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Attention and Self-Attention</title>
      <p>
        Attention can be described as mapping a query
and a set of key value pairs to an output
        <xref ref-type="bibr" rid="ref11">(Vaswani
et al., 2017)</xref>
        . Specifically, when I say attention or
‘normal’ attention, I am referring to Scaled
DotProduct Attention. Scaled Dot-Product Attention
is computed as:
456
(, , ) =  3789 :  (1)
Where Q, K, and V are matrices representing the
queries, keys, and values respectively
        <xref ref-type="bibr" rid="ref11">(Vaswani
et al., 2017)</xref>
        .
      </p>
      <p>Self-Attention then is where Q, K, and V all
come from the same source vector after a linear
projection. This allows each position in the vector
to attend to every other position in the same
vector.
2.7</p>
    </sec>
    <sec id="sec-7">
      <title>Simplified and Local Attention</title>
      <p>Simplified and local attention can both be thought
of as trying to reinforce the mapping of a key to
value pair by extracting extra information from
the key. I compute a linear transformation
followed by a softmax to get the weights on the
values. These weights and the initial values are
multiplied together element-wise in order to highlight
which of the values are the most important for
solving the problem. Simplified attention can also
be thought of as reinforcing a one-to-one
correspondence between the key and the value.</p>
      <p>
        Local attention is like simplified attention
except instead of performing a linear projection on
the keys, local attention performs a convolutional
projection on the keys. This allows for the
network to use local information in the keys to attend
to the values.
In multi-head attention, attention is performed
multiple times on different projections of the input
        <xref ref-type="bibr" rid="ref11">(Vaswani et al., 2017)</xref>
        . In this paper, I either use
one or eight heads in every experiment with
attention, in order to get the best results and to compare
the different methods accurately.
In this paper, I tested seven different models, six
of which extend the base model using some type
of attention. In the models with attention,
self-attention is used right after the final convolution and
right before the global pooling operation.
While CNN’s support input of any size, they lack
the ability to generate a fixed size input and
instead output a tensor that is proportional in size to
the input size. In order for the output of the
network to have a fixed size of 25, I use max pooling
        <xref ref-type="bibr" rid="ref8">(Scherer et al., 2010)</xref>
        along the time dimension of
the last convolutional layer. I perform the max
pooling globally, which simply means that I take
the maximum value of the whole time dimension
instead of from a sliding window of the time
dimension.
      </p>
    </sec>
    <sec id="sec-8">
      <title>Experiment and Results</title>
      <p>In this section, I go over the ITAmoji task
description and limitations, as well as my results on the
task.
3.1</p>
    </sec>
    <sec id="sec-9">
      <title>ITAmoji Task</title>
      <p>
        This model was initially designed for the ITAmoji
task in EVALITA 2018
        <xref ref-type="bibr" rid="ref7">(Ronzano et al., 2018)</xref>
        .
The goal of this task is to predict which of 25
emojis (shown in Table 1) is most likely to be in a
given Italian tweet. The provided dataset is
250,000 Italian tweets with one emoji label per
tweet, and no additional data is allowed for
training the models. However, it is allowed to use
additional data to train unsupervised systems like
word embeddings. All results in the coming
subsections were tested on the dataset of 25,000
Italian tweets provided by the organizers.
      </p>
      <p>Emoji Label
red heart
face with tears of joy
smiling face with heart eyes
kiss mark
winking face
smiling face with smiling 5.13
eyes
beaming face with smiling 4.11
eyes
grinning face 3.54
face blowing a kiss
smiling face with sunglasses 2.80
rolling on the floor laughing 2.18
thumbs up
thinking face
blue heart
winking face with tongue
face screaming in fear
flexed biceps
face savoring food
grinning face with sweat
%
Samples
20.28
19.86
9.45
1.12
5.35
3.34
2.57
2.16
2.02
1.93
1.78
1.67
1.55
1.52
loudly crying face
top arrow
two hearts
sun
rose
sparkles
1.49
1.39
1.36
1.28
1.06
1.06
Table 2 shows my official results from the
ITAmoji competition, as well as the first and
second group scores. Table 3 shows the best result
(evaluated after the competition was complete)
according to the macro f1 score of the seven
different models I trained during the competition. It
also shows the micro f1 score at the same run of
the best macro f1 score for comparison. Table 4
shows the upper and lower bounds of the f1 scores
after the scores have stopped increasing and have
plateaued.</p>
    </sec>
    <sec id="sec-10">
      <title>Model Macro F1 Micro F1</title>
      <p>1st Place Group 0.365 0.477
2nd Place Group 0.232 0.401
Run 3: Simplified 0.106 0.294
Attention
Run 2: 1 Head Atten- 0.102 0.313
tion</p>
      <p>Run 1: No Attention2 0.019 0.064
Table 2: Official results from the ITAmoji
competition, as compared to the first and second place
groups.</p>
    </sec>
    <sec id="sec-11">
      <title>Model Macro F1 Micro F1</title>
      <p>8 Head Attention 0.113 0.316
1 Head Attention 0.105 0.339
Local Attention 0.106 0.341
8 Head Local 0.106 0.337
Simplified Attention 0.106 0.341
8 Head Simplified 0.109 0.308
No Attention 0.11 0.319
Table 3: The best results from the different models
on the dataset, run after the competition was over.
2 Due to an off-by-one error in the conversion from
network output to emoji, the official results for the no attention
network are much worse than in actuality.</p>
    </sec>
    <sec id="sec-12">
      <title>Model Macro F1 Micro F1</title>
      <p>8 Head Attention [.10, .11] [.30, .36]
1 Head Attention [.09, .11] [.30, .36]
Local Attention [.10, .11] [.30, .35]
8 Head Local [.10, .11] [.34, .36]
Simplified Attention [.10, .11] [.32, .36]
8 Head Simplified [.10, .11] [.31, .36]
No Attention [.10, .11] [.30, .36]
Table 4: The upper and lower bounds of the f1
scores of the different model types after the scores
have plateaued in training and start oscillating.</p>
      <p>While 8 head attention did outperform the 8
head local and simplified models, it’s interesting
to note that that isn’t the case for the 1 head
versions. Additionally, the bounds for the scores
significantly overlap so there is no statistically
significant gains for one method over the other. This
result, along with my comparatively worse scores
is probably because the max pooling at the end of
my model was throwing away too much
information in order to make the size consistent.
4</p>
    </sec>
    <sec id="sec-13">
      <title>Discussion</title>
      <p>In the upcoming sections, I discuss a possible
problem with the design of my models and
propose a few solutions for that problem. I further
discuss the two new modifications on attention
that I proposed and their possible uses.
4.1</p>
    </sec>
    <sec id="sec-14">
      <title>Loss of Information While Pooling</title>
      <p>For the problem of throwing away too much
information during the pooling or downsampling
phase, there are three main approaches that could
be explored, each with their positives and
negatives.</p>
      <p>The first approach is to just fix the size of the
input and use fully connected layers or similar
approaches to find the correct output. This is the
current approach by most researchers, and has shown
good results. The main negative here is that the
input size must be fixed, and fixing the input size
could mean throwing away or adding information
that isn’t naturally there.</p>
      <p>The second approach is to use a recurrent
neural network neuron like an LSTM or a GRU with
size equal to the output size to parse the result and
output singular values for the final sequence. This
would probably lead to better results but is going
to be slower than the other approaches.</p>
      <p>The last approach is to use convolutional
layers with a large kernel size and stride (e.g. stride
equal to the size of the kernel). This would allow
the network to shrink the output size naturally,
and would be faster than using an LSTM. The
issue here is that in order to maintain the property
that the network can have any input size, pooling
or some other method of downsampling has to be
used, potentially throwing away useful data.
4.2</p>
    </sec>
    <sec id="sec-15">
      <title>Potential Uses of Simplified and Local</title>
    </sec>
    <sec id="sec-16">
      <title>Attention</title>
      <p>While the original idea behind simplifying
attention in such a manner as presented in this paper
was to reduce computational cost and encourage
easier learning by enforcing a softmax
distribution of data, there didn’t seem to be any benefit in
doing so. In most cases the computational cost of
a couple of matrix multiplications versus an
element-wise product is negligible, so it would
usually be better to just apply normal attention in
those cases as it already covers the case of
simplified attention in its implementation.</p>
      <p>Similar to simplified attention, it doesn’t
necessarily make sense to use local attention instead of
normal attention for small input sizes. Instead, it
might make sense to switch out the linear
projection on the queries and keys in normal attention
with a convolutional projection but otherwise
perform the scaled-dot product attention normally.
This could be potentially useful if the problem
being approached needs to map patterns to values
instead of mapping values to values. One could of
course extend this even further by also performing
a convolutional projection on the values in order
to map local patterns to other local patterns, and
so on.</p>
      <p>On the other hand, the local attention suggested
in this paper could be useful in neural nets used
for images and other large data, where it might not
make sense to attend over the whole input. This is
especially true in the initial layers of such neural
networks where the neurons are only looking at a
small section of the input in the first place.
Beyond the smaller memory demands compared to
normal attention, local attention could be useful in
these layers because it provides a method to
naturally figure out which patterns are important at
these early layers.</p>
      <p>
        Of course an alternative to local attention is to
just take small patches of the image and apply the
original formulation of scaled-dot product
attention to get similar results. This idea was originally
suggested as future work in
        <xref ref-type="bibr" rid="ref11">Vaswani et al. (2017)</xref>
        .
5
      </p>
    </sec>
    <sec id="sec-17">
      <title>Conclusion</title>
      <p>In this work I present simplified and local
attention and test the methods in comparison to similar
models with normal attention and without any
kind of attention at all. I also introduced a new
strategy for classifying data with fully
convolutional networks with any sized input.</p>
      <p>The new model design was not without its own
flaws, as it showed poor results for all
modifications of the method. The poor results were
probably due to the final pooling layer throwing away
too much information. A better method would be
to use LSTMs or specially designed convolutions
in order to shrink the output to the correct size.</p>
      <p>Future work will include further explorations of
simplified and local attention to really get a grasp
of what tasks they are good at and where, if
anywhere, they show better efficiency or results than
normal attention. In the future I will also further
explore the new strategy for classification on any
sized input with fully convolutional model and see
what I can change and update in order to improve
the results of the model.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barham</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Devin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghemawat</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Irving</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isard</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kudlur</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <year>2016</year>
          , November.
          <article-title>Tensorflow: a system for large-scale machine learning</article-title>
          .
          <source>In OSDI</source>
          (Vol.
          <volume>16</volume>
          , pp.
          <fpage>265</fpage>
          -
          <lpage>283</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Conneau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwenk</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrault</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Lecun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <year>2016</year>
          .
          <article-title>Very deep convolutional networks for text classification</article-title>
          .
          <source>arXiv preprint arXiv:1606</source>
          .
          <fpage>01781</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Chollet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <year>2015</year>
          . Keras.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          (pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.E.</given-names>
          </string-name>
          ,
          <year>2012</year>
          .
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          (pp.
          <fpage>1097</fpage>
          -
          <lpage>1105</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Long</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shelhamer</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Darrell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <year>2015</year>
          .
          <article-title>Fully convolutional networks for semantic segmentation</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          (pp.
          <fpage>3431</fpage>
          -
          <lpage>3440</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Ronzano</surname>
            , Francesco and Barbieri, Francesco and
            <given-names>Wahyu</given-names>
          </string-name>
          <string-name>
            <surname>Pamungkas</surname>
          </string-name>
          , Endang and Patti, Viviana and Chiusaroli, Francesca.
          <year>2018</year>
          .
          <article-title>Overview of the EVALITA 2018 Italian Emoji Prediction (ITAMoji)</article-title>
          .
          <source>Proceedings of Fifth Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2018</year>
          ) &amp;
          <article-title>Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Scherer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Müller</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Behnke</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <year>2010</year>
          .
          <article-title>Evaluation of pooling operations in convolutional architectures for object recognition</article-title>
          .
          <source>In Artificial Neural Networks-ICANN</source>
          <year>2010</year>
          (pp.
          <fpage>92</fpage>
          -
          <lpage>101</lpage>
          ). Springer, Berlin, Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Tompson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goroshin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>LeCun</surname>
          </string-name>
          , Y. and
          <string-name>
            <surname>Bregler</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <year>2015</year>
          .
          <article-title>Efficient object localization using convolutional networks</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          (pp.
          <fpage>648</fpage>
          -
          <lpage>656</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Van Den Oord</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dieleman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalchbrenner</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senior</surname>
            ,
            <given-names>A.W.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <year>2016</year>
          ,
          <string-name>
            <surname>September.</surname>
          </string-name>
          <article-title>WaveNet: A generative model for raw audio</article-title>
          .
          <source>In SSW</source>
          (p.
          <fpage>125</fpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uszkoreit</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>A.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaiser</surname>
          </string-name>
          , Ł. and
          <string-name>
            <surname>Polosukhin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          (pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and LeCun, Y.,
          <year>2015</year>
          .
          <article-title>Characterlevel convolutional networks for text classification</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          (pp.
          <fpage>649</fpage>
          -
          <lpage>657</lpage>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>