<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bidirectional Attentional LSTM for Aspect Based Sentiment Analysis on Italian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giancarlo Nicola</string-name>
          <email>giancarlo.nicola01@universitadipavia.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Pavia</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1994</year>
      </pub-date>
      <abstract>
        <p>English. This paper describes the SentITA system that participated to the ABSITA task proposed in Evalita 2018. The system is based on a Bidirectional Long Short Term Memory network with attention that exploits word embeddings and sentiment specific polarity embeddings. The model also leverages grammatical information from POS tagging and NER tagging. The system participated in both the Aspect Category Detection (ACD) and Aspect Category Polarity (ACP) tasks achieving the 5th place in the ACD task and the 2nd in the ACD task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Italiano. Questo paper descrive il
sistema SentITA valutato nel task ABSITA
proposto all’interno di Evalita 2018. Il
sistema e` basato su una rete nuerale
ricorrente con celle di memoria di tipo Long
Short Term Memory e con implementato
un meccanismo d’attenzione. Il modello
sfrutta sia word embeddings generali sia
polarity embeddings specifici per la
sentiment analysis ed inoltre fa uso delle
informazioni derivanti dal POS-tagging e
dal NER-tagging. Il sistema ha
partecipato sia nella sfida di Aspect Category
Detection (ACD) sia in quella di Aspect
Category Polarity (ACP) posizionandosi
al quinto posto nella prima e al secondo
posto nella seconda.</p>
    </sec>
    <sec id="sec-2">
      <title>1 Introduction</title>
      <p>
        This paper describes the SentITA system that
participated to the ABSITA task
        <xref ref-type="bibr" rid="ref2">(Basile et al. 2018)</xref>
        proposed in Evalita 2018. In ABSITA the task
consists in performing Aspect Based Sentiment
Analysis (ABSA) on self-reliant sentences scraped
from the ”booking.com” website. The aspects are
related to the accommodation reviews and
comprehend topics like cleanliness, comfort, location,
etc. The task is divided in two subtasks
Aspect Category Detection (ACD) and Aspect
Category Polarity (ACP). The fist, ACD consists in
identifying the aspects mentioned in the sentence,
while the second requires to associate a
sentiment polarity label to the aspects evoked in the
sentence. Both the tasks are addressed with the
same architecture and the same data
preprocessing. The system is based on a deep learning model,
a Bidirectional Long Short Term Memory
network with attention. The model exploits word
embeddings, sentiment specific polarity embeddings
and it leverages also grammatical and information
from POS tagging and NER tagging.
      </p>
      <p>
        Recently, deep learning has emerged as a
powerful machine learning technique achieving
stateof-the-art results in many application domains,
including sentiment analysis. Among the deep
learning frameworks applied to sentiment
analysis, many employ a combination of semantic
vector representations
        <xref ref-type="bibr" rid="ref18">(Mikolov et al. 2013)</xref>
        ,
(Pennignton et al. 2014) and different deep learning
architectures. Long Short-Term Memory (LSTM)
networks
        <xref ref-type="bibr" rid="ref13">(Hochreiter and Schmidhuber 1997)</xref>
        ,
        <xref ref-type="bibr" rid="ref27">(Socher et al. 2013)</xref>
        ,
        <xref ref-type="bibr" rid="ref6">(Cho et al. 2014)</xref>
        have
been applied to model complex and long term
non-local relationships in both word level and
character level text sequences. Recursive
Neural Tensor Networks (RNTN) have shown great
results for semantic compositionality
        <xref ref-type="bibr" rid="ref26">(Socher et
al. 2011)</xref>
        ,
        <xref ref-type="bibr" rid="ref27">(Socher et al. 2013)</xref>
        and also
convolutional networks (CNN) for both sentiment
analysis
        <xref ref-type="bibr" rid="ref8">(Collobert et al 2011)</xref>
        and sentence modelling
        <xref ref-type="bibr" rid="ref14">(Kalchbrenner et al. 2014)</xref>
        have performed better
than previous state of the art methodologies. All
these methods in most of the applications receive
in input a vector representation of words called
word embeddings.
        <xref ref-type="bibr" rid="ref17">(Mikolov 2012)</xref>
        ,
        <xref ref-type="bibr" rid="ref18">(Mikolov et
al. 2013)</xref>
        and (Pennignton et al. 2014), further
expanding the work on word embeddings
        <xref ref-type="bibr" rid="ref3">(Bengio et al 2003)</xref>
        , that grounds on the idea of
distributed representations for symbols
        <xref ref-type="bibr" rid="ref12">(Hinton et
al 1986)</xref>
        , have introduced unsupervised learning
methods to create dense multidimensional spaces
where words are represented by vectors. The
position of such vectors is related to their semantic
meaning and grammatical properties and they are
widely used in all modern NLP. In fact, they allow
for a dimensionality reduction compared to
traditional sparse vectors space models and they are
often used as pre-trained initialization for the first
embedding layers of the neural networks in NLP
tasks. In
        <xref ref-type="bibr" rid="ref15 ref16 ref19 ref22">(Le and Mikolov 2014)</xref>
        , expanding the
previous work on word embeddings, is developed
a model capable of representing also sentences in
a dense multidimensional space. In this case too,
sentences are represented by vectors whose
position is related to the semantic content of the
sentence with similar sentences represented by
vectors that are close to each other.
      </p>
      <p>
        When working with isolated and short
sentences, often with a specific writing style, like
tweets or phrases extracted from internet reviews
many long term text dependencies are lost and
not exploitable. In this situation it is important
that the model learns both to pay attention to
specific words that have key roles in determining the
sentence polarity like negations, magnifiers,
adjectives and to model the discourse but with less
focus on long term dependencies (due to the text
brevity). For this reason, deep learning word
embedding based models augmented with task
specific gazettes (dictionaries) and features,
represent a solid baseline when working with these
kind of datasets
        <xref ref-type="bibr" rid="ref20">(Nakov et al. 2016)</xref>
        <xref ref-type="bibr" rid="ref1">(Attardi et
al. 2016)</xref>
        <xref ref-type="bibr" rid="ref5">(Castellucci et al. 2016)</xref>
        <xref ref-type="bibr" rid="ref7">(Cimino et al.
2016)</xref>
        <xref ref-type="bibr" rid="ref9">(Deriu et al. 2016)</xref>
        . In this system, a polarity
dictionary for Italian has been included as feature
to the model in addition to the word embeddings.
Moreover every sentence during preprocessing is
augmented with its NER tags and POS tags which
then are used as features in the model. Thanks
to the inclusion of these features relevant for the
considered task in combination with word
embeddings and an attentional bidirectional LSTM
recurrent neural network, the model achieves useful
results with some thousands labelled examples.
      </p>
      <p>The remainder of the paper presents the model
and the experiments on the ABSITA task. in
Section 2 the model and its features are explained; in
Section 3 the model training and its performances
are discussed; in Section 4 a conclusion with the
next improvement of the model is given.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Description of the system</title>
      <p>The model implemented is an Attentional
Bidirectional Recurrent Neural Network with LSTM
cells. It operates at word level and therefore each
sentence is represented as a sequence of words
representations that are sequentially fed to the
model one after another until the sequence has
been entirely used up. One sentence sequence
coupled with its polarity scores represent a single
datapoint for the model.</p>
      <p>
        The input to the model are sentences,
represented as sequence of word representations. The
maximum sequence length has been set to 35,
with shorter sentences left-padded to this length
and longer sentences cut to this length. Each
word of the sequence is represented by five
vectors corresponding to 5 different features that are:
high dimensional word embedding, word
polarity, word NER tag, word POS tag, custom low
dimensional word embedding. The high
dimensional word embeddings are the pretrained
Fastext embeddings for Italian
        <xref ref-type="bibr" rid="ref11">(Grave et al. 2018)</xref>
        .
They are 300-dimensional vectors obtained using
the skip-gram model described in
        <xref ref-type="bibr" rid="ref4">(Bojanowski et
al. 2016)</xref>
        with default parameters. The word
polarity is obtained from the OpeNER
Sentiment Lexicon Italian
        <xref ref-type="bibr" rid="ref24">(Russo et al. 2016)</xref>
        . This
freely available Italian Sentiment Lexicon
contains a total of 24.293 lexical entries annotated
for positive/negative/neutral polarity. It was
semiautomatically developed using a propagation
algorithm starting from a list of seed key-words and
manually reviewing the most frequent.
      </p>
      <p>Both the NER tags and POS tags are obtained
from the Spacy library Tagger model for Italian
(Spacy 2.0.11 - https://spacy.io/). The custom low
dimensional word embeddings are generated by
random initialization and are inserted to provide
an embedding representation of the words that are
missing from the Fastext embeddings, which
otherwise would all be represented by the same out
of vocabulary token (OOV token). In general,
it could be possible to train and fine-tune these
custom embeddings on specific datasets to let the
model learn the usage of words in specific cases.</p>
      <p>The information extracted from the OpeNER
SenFigure 1: Model architecture
timent Lexicon Italian are the word polarity with
its confidence and they are concatenated in a
vector of length 2 that is one of the input to the first
layer of the network. The NER tags and POS tags
instead are mapped to randomly initialized
embeddings of dimensionality respectively 2 and 4
that are not trained during the model training for
the task submission. With more data available it
would probably be beneficial to train all the NER,
POS and custom embeddings but for this specific
dataset the results were comparable and slightly
better when not training the embeddings.</p>
      <p>
        The model, whose architecture is schematized
in fig. 1, performs in its initial layer a
dimensionality reduction on the Fastext embeddings and then
concatenates them with the rest of the embeddings
(polarity, NER tag, POS tag, and custom word
embeddings) for each each timestep (word) of the
sequence. The concatenation of the embeddings is
fed in a sequence of two bidirectional recurrent
layers with LSTM cells. The result of these
recurrent layers is passed to the attention mechanism
presented in
        <xref ref-type="bibr" rid="ref23">(Raffel et al. 2016)</xref>
        and finally to
the dense layers that outputs the aspect detection
and aspect polarity signals. The attention
mechanism in this formulation, produces a fixed-length
embedding of the input sequence by computing
an adaptive weighted average of the sequence of
states (normally denoted as ”h”) of the RNN. This
form of integration is similar to the ”global
temporal pooling” described in
        <xref ref-type="bibr" rid="ref25">(Sander 2014)</xref>
        , which
is based on the ”global average pooling”
technique of
        <xref ref-type="bibr" rid="ref19">(Min et al. 2014)</xref>
        . The non linear
activations used in the model are Rectified Linear
Units (ReLU) for the internal dense layers,
hyperbolic tangent (tanh) in the recurrent layers and
sigmoid in the output dense layer. In order to
contrast overfitting the dropout mechanism has been
used after the Fastext embedding dimensionality
reduction with rate 0.5, in both the recurrent
layers between each timestep with rate 0.5 and on the
output of the recurrent layers with rate 0.3.
      </p>
      <p>The model has 61,368 trainable parameters and
a total of 45,233,366 parameters, the majority of
them representing the Fastext embedding matrix
(45,000,300). Compared to many NLP models
used today the number of trainable parameters is
quite small to reduce the possibility of
overfitting the training dataset (6,337 examples is small
compared to many English sentiment datasets) and
also because is compensated by the addition of
engineered features like polarity dictionary, NER tag
and POS tag that help in classifying the examples.
3</p>
    </sec>
    <sec id="sec-4">
      <title>Training and results</title>
      <p>The only preprocessing applied to the text is the
conversion of each character to its lower case
form. Then, the vocabulary of the model is
limited to the first 150,000 words of the Fastext
embeddings trough a cap on the max number of
embeddings, due to memory constraints of the GPU
used for training the model. The Fastext
embeddings are sorted by descending frequency of
appearance in their training corpus, thus the
vocabulary comprises the 150,000 most frequent words
in Italian. The other words that remain out of
this cut are represented in the model high
dimensional embeddings (Fastext embeddings) by an out
of vocabulary token. However, all the training set
words are anyhow included in the custom low
dimensional word embeddings; this is done since
both our training text and in general users text
(specially when working with reviews, tweets,
social network platforms) could be quite different
from the one on which Fastext embeddings are
trained. In addition the NER-tagging and
POStagging models for Italian included in the Spacy
library are applied to the text to compute the
additional NER-tags and POS-tags features for each
word.</p>
      <p>To train the model and generate the challenge
submission a k-fold cross validation strategy has
been applied. The dataset has been divided in
5 folds and 5 different instantiations of the same
model (with the same architecture) have been
trained picking each time a different fold as
validation set (20%) and the remaining 4 folds as
training set (80%). The number of training epochs
is defined with the early stopping technique with
patience parameter equal to 7. Once the
training epochs are completed, the model snapshot that
achieved the best validation loss is loaded. At the
end the predictions from the 5 models have been
averaged together and thresholded at 0.5. The
training of five different instantiations of the same
model and the averaging of their predictions
overcomes the fact that in each kth-fold the model
selection based on the best validation loss is biased
on the validation fold itself.</p>
      <p>
        Each of the five models is trained minimizing
the crossentropy loss on the different classes with
the Nesterov Adam (Nadam) optimizer
        <xref ref-type="bibr" rid="ref10">(Dozat
2015)</xref>
        with default parameters ( = 0:002, 1 =
0:9, 2 = 0:999, schedule decay = 0:004). The
Nesterov Adam optimizer is similar to the Adam
optimizer
        <xref ref-type="bibr" rid="ref15">(Kingma et al. 2014)</xref>
        but were
momentum is replaced with Nesterov momentum
        <xref ref-type="bibr" rid="ref21">(Nesterov 1983)</xref>
        . Adam in fact, combines two
algorithms known to work well for different reasons:
momentum, which points the model in a better
direction, and RMSProp, which adapts how far the
model goes in that direction on a per-parameter
basis. However, Nesterov momentum which can be
viewed as a simple modification of the former,
increases stability, and can sometimes provide a
distinct improvement in performance, superior to
momentum
        <xref ref-type="bibr" rid="ref28">(Sutskever et al. 2013)</xref>
        . For this reason
the two approaches are combined in the Nadam
optimizer.
      </p>
      <p>This system obtained the 5th place in the ACD
and the 2nd place in the ACP task as reported
respectively in Table 1 and Table 2. In these tables
the performances of the systems participating to
the challenge have been ranked by F1-score from
the task organizers. In particular, it is interesting
the second place in the ACP since the model is
more oriented towards polarity classification for
which it has specific dictionaries more than
aspect detection. This is confirmed also from the
high precision score obtained from the model in
the ACP task, the 2nd highest among the
participating systems.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>The results obtained by the SentITA system at
ABSITA 2018 are promising, as the system placed
2nd in the ACP and 5th in the ACD task but not</p>
      <p>Micro
Precision
0.8264
0.8612
0.7472
0.7387
0.8735
0.6869
0.4123
0.5452
0.2451
very far from the 1st in terms of F1-score. The
model in general shows a high precision but in
general a lower recall compared to the other
systems. The proposed architecture makes use of
different features that is easy to obtain through
other models like POS and NER tags, polarity and
word embeddings, for this reason, the human
effort in the data preprocessing is very limited. One
important direction to further improve the model
would be to rely more on unsupervised learning,
which at the moment is used only for the word
embeddings. It could be possible to integrate in
the model features based on language models or
encoder-decoder networks, for example. More
unsupervised learning would better ensure the model
generalization to cover most of the argument and
lexical content of the Italian language due to the
large quantity of text available and thus improving
also the model recall.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Giuseppe</given-names>
            <surname>Attardi</surname>
          </string-name>
          , Daniele Sartiano, Chiara Alzetta,
          <string-name>
            <given-names>Federica</given-names>
            <surname>Semplici</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Convolutional Neural Networks for Sentiment Analysis on Italian Tweets</article-title>
          . CLiC-it/EVALITA (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Pierpaolo</given-names>
            <surname>Basile</surname>
          </string-name>
          and
          <string-name>
            <given-names>Valerio</given-names>
            <surname>Basile</surname>
          </string-name>
          and
          <string-name>
            <given-names>Danilo</given-names>
            <surname>Croce</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Polignano</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Overview of the EVALITA 2018 Aspect-based Sentiment Analysis task (ABSITA)</article-title>
          .
          <article-title>Proceedings of the 6th evaluation campaign of Natural Language Processing and Speech tools for Italian (EVALITA'18), CEUR</article-title>
          .org, Turin
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ducharme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vincent</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Janvin</surname>
          </string-name>
          (
          <year>2003</year>
          )
          <article-title>A neural probabilistic language model</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          ,
          <volume>3</volume>
          :
          <fpage>1137</fpage>
          -
          <lpage>1155</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          (
          <year>2016</year>
          )
          <article-title>Enriching Word Vectors with Subword Information</article-title>
          . arXiv:
          <volume>1607</volume>
          .04606v2
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Giuseppe</given-names>
            <surname>Castellucci</surname>
          </string-name>
          , Danilo Croce,
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Basili</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Context-aware Convolutional Neural Networks for Twitter Sentiment Analysis in Italian. CLiC-it/EVALITA (</article-title>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. van Merrienboer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gulcehre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bougares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          , and
          <string-name>
            <surname>Y. Bengio.</surname>
          </string-name>
          (
          <year>2014</year>
          )
          <article-title>Learning phrase representations using RNN encoder-decoder for statistical machine translation</article-title>
          .
          <source>In EMNLP</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Cimino</surname>
          </string-name>
          ,
          <source>Felice Dell'Orletta</source>
          .
          <year>2016</year>
          .
          <article-title>Tandem LSTM-SVM Approach for Sentiment Analysis</article-title>
          . Castellucci, Giuseppe et al.
          <article-title>CLiC-it/EVALITA (</article-title>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Karlen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Kuksa. Natural Language</surname>
          </string-name>
          <article-title>Processing (Almost) from Scratch</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>12</volume>
          :
          <fpage>2493</fpage>
          -
          <lpage>2537</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Jan</given-names>
            <surname>Deriu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mark</given-names>
            <surname>Cieliebak</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Sentiment Detection using Convolutional Neural Networks with Multi-Task Training and Distant Supervision</article-title>
          . CLiC-it/EVALITA (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Timothy</given-names>
            <surname>Dozat</surname>
          </string-name>
          (
          <year>2015</year>
          )
          <article-title>Incorporating Nesterov Momentum into Adam</article-title>
          . http://cs229.stanford.
          <source>edu/proj2015/054 report.pdf.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>E. Grave*</surname>
          </string-name>
          , P. Bojanowski*,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          (
          <year>2018</year>
          )
          <article-title>Learning Word Vectors for 157 Languages</article-title>
          .
          <source>Proceedings of the International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>McClelland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and D. E.</given-names>
            <surname>Rumelhart</surname>
          </string-name>
          (
          <year>1986</year>
          )
          <article-title>Distributed representations</article-title>
          . In Rumelhart, D. E. and
          <string-name>
            <surname>McClelland</surname>
            ,
            <given-names>J. L</given-names>
          </string-name>
          ., editors,
          <source>Parallel Distributed Processing: Explorations in the Microstructure of Cognition</source>
          .
          <year>1986</year>
          . Volume
          <volume>1</volume>
          : Foundations, MIT Press, Cambridge, MA. pp
          <fpage>77</fpage>
          -
          <lpage>109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          . Long
          <string-name>
            <surname>Short-Term Memory</surname>
          </string-name>
          .
          <source>Neural Computation</source>
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          ,
          <year>1997</year>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>N.</given-names>
            <surname>Kalchbrenner</surname>
          </string-name>
          , E. Grefenstette,
          <string-name>
            <surname>P. Blunsom.</surname>
          </string-name>
          (
          <year>2014</year>
          )
          <article-title>A Convolutional Neural Network for Modelling Sentences</article-title>
          .
          <source>In Proceedings of ACL</source>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Kingma</surname>
          </string-name>
          , Diederik and Ba, Jimmy. (
          <year>2014</year>
          ).
          <article-title>Adam: A Method for Stochastic Optimization</article-title>
          . International Conference on Learning Representations. https://arxiv.org/pdf/1412.6980.pdf
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <article-title>Distributed Representations of Sentences and Documents</article-title>
          .
          <source>Proceedings of the 31 st International Conference on Machine Learning</source>
          , Beijing, China,
          <year>2014</year>
          . JMLR: W&amp;CP, volume
          <volume>32</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          . (
          <year>2012</year>
          )
          <article-title>Statistical Language Models Based on Neural Networks</article-title>
          .
          <source>PhD thesis</source>
          ,
          <source>PhD Thesis</source>
          , Brno University of Technology,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Corrado, and
          <string-name>
            <surname>J. Dean.</surname>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>Efficient estimation of word representations in vector space</article-title>
          .
          <source>In Proceedings of Workshop at International Conference on Learning Representations</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Min</given-names>
            <surname>Lin</surname>
          </string-name>
          , Qiang
          <string-name>
            <surname>Chen</surname>
            , and
            <given-names>Shuicheng</given-names>
          </string-name>
          <string-name>
            <surname>Yen</surname>
          </string-name>
          .
          <article-title>Network in network</article-title>
          .
          <source>arXiv preprint arXiv:1312.4400</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Preslav</given-names>
            <surname>Nakov</surname>
          </string-name>
          , Alan Ritter, Sara Rosenthal, Fabrizio Sebastiani,
          <string-name>
            <given-names>Veselin</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          .
          <year>2016</year>
          . SemEval
          <article-title>-2016 Task 4: Sentiment Analysis in Twitter</article-title>
          .
          <source>Proceedings of SemEval-2016</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          , San Diego, California, June 16-17,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nesterov</surname>
          </string-name>
          (
          <year>1983</year>
          )
          <article-title>A method of solving a convex programming problem with convergence rate o (1/k2)</article-title>
          .
          <source>In Soviet Mathematics Doklady</source>
          , volume
          <volume>27</volume>
          , pages
          <fpage>372</fpage>
          -
          <lpage>376</lpage>
          ,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          , and
          <string-name>
            <surname>C. Manning.</surname>
          </string-name>
          (
          <year>2014</year>
          )
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          , pages
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          , Doha, Qatar, October. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Colin</surname>
            <given-names>Raffel</given-names>
          </string-name>
          , Daniel P. W.
          <string-name>
            <surname>Ellis</surname>
          </string-name>
          (
          <year>2016</year>
          )
          <article-title>FeedForward Networks with Attention Can Solve Some Long-Term Memory Problems</article-title>
          . https://arxiv.org/abs/1512.08756
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Russo</surname>
          </string-name>
          , Irene; Frontini, Francesca and Quochi, Valeria,
          <year>2016</year>
          ,
          <string-name>
            <given-names>OpeNER</given-names>
            <surname>Sentiment Lexicon Italian - LMF</surname>
          </string-name>
          ,
          <article-title>ILC-CNR for CLARIN-IT repository hosted at Institute for Computational Linguistics ”A</article-title>
          . Zampolli”, National Research Council, in Pisa, http://hdl.handle.
          <source>net/20.500</source>
          .11752/ILC-73.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Sander</given-names>
            <surname>Dieleman</surname>
          </string-name>
          .
          <article-title>Recommending music on Spotify with deep learning</article-title>
          . http://benanne.github.io/
          <year>2014</year>
          /08/05/spotifycnns.html,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          , and
          <string-name>
            <surname>Christopher D. Manning.</surname>
          </string-name>
          (
          <year>2011</year>
          )
          <article-title>Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions</article-title>
          .
          <source>In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP).</source>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perelygin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          , and
          <string-name>
            <surname>Christopher Potts.</surname>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>Recursive deep models for semantic compositionality over a sentiment treebank</article-title>
          .
          <source>In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>1631</fpage>
          -
          <lpage>1642</lpage>
          , Stroudsburg, PA, October. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          , James Martens, George Dahl,
          <source>Geoffrey Hinton (2013) Proceedings of the 30th International Conference on Machine Learning</source>
          , PMLR
          <volume>28</volume>
          (
          <issue>3</issue>
          ):
          <fpage>1139</fpage>
          -
          <lpage>1147</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>