<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sentiment Analysis using Convolutional Neural Networks with Multi-Task Training and Distant Supervision on Italian Tweets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Deriu</string-name>
          <email>deri@zhaw.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mark Cieliebak</string-name>
          <email>ciel@zhaw.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Zurich University of Applied Sciences</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>English. In this paper, we propose a classifier for predicting sentiments of Italian Twitter messages. This work builds upon a deep learning approach where we leverage large amounts of weakly labelled data to train a 2-layer convolutional neural network. To train our network we apply a form of multi-task training. Our system participated in the EvalItalia-2016 competition and outperformed all other approaches on the sentiment analysis task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Sentiment analysis is a fundamental problem
aiming to give a machine the ability to understand the
emotions and opinions expressed in a written text.
This is an extremely challenging task due to the
complexity and variety of human language.
The sentiment polarity classification task of
EvalItalia-2016 1 (sentipolc) consists of three
subtasks which cover different aspects of sentiment
detection: T 1 : Subjectivity detection: is the tweet
subjective or objective? T 2 : Polarity detection: is
the sentiment of the tweet neutral, positive,
negative or mixed?
T 3 : Irony detection: is the tweet ironic?
The classic approaches to sentiment analysis
usually consist of manual feature engineering and
applying some sort of classifier on these features
        <xref ref-type="bibr" rid="ref6">(Liu, 2015)</xref>
        . Deep neural networks have shown
great promises at capturing salient features for
these complex tasks
        <xref ref-type="bibr" rid="ref10 ref11 ref3 ref7 ref8">(Mikolov et al., 2013b;
Severyn and Moschitti, 2015a)</xref>
        . Particularly
successful for sentiment classification were Convolutional
Neural Networks (CNN)
        <xref ref-type="bibr" rid="ref10 ref10 ref11 ref11 ref3 ref3 ref4 ref5">(Kim, 2014;
Kalchbrenner et al., 2014; Severyn and Moschitti, 2015b;
Johnson and Zhang, 2015)</xref>
        , on which our work
builds upon. These networks typically have a large
number of parameters and are especially effective
when trained on large amounts of data.
      </p>
      <p>In this work, we use a distant supervision
approach to leverage large amounts of data in order
to train a 2-layer CNN 2. More specifically, we
train a neural network using the following
threephase procedure: P 1 : creation of word
embeddings for the initialization of the first layer based
on an unsupervised corpus of 300M Italian tweets;
P 2 : distant supervised phase, where the
network is pre-trained on a weakly labelled dataset of
40M tweets where the network weights and word
embeddings are trained to capture aspects related
to sentiment; and P 3 : supervised phase, where
the network is trained on the provided supervised
training data consisting of 7410 manually labelled
tweets.</p>
      <p>
        As the three tasks of EvalItalia-2016 are closely
related we apply a form of multitask training as
proposed by
        <xref ref-type="bibr" rid="ref2">(Collobert et al., 2011)</xref>
        , i.e. we train
one CNN for all the tasks simultaneously. This
has two advantages: i) we need to train only
one model instead of three models, and ii) the
CNN has access to more information which
benefits the score. The experiments indicate that the
multi-task CNN performs better than the
single2We here refer to a layer as one convolutional and one
pooling layer.
task CNN. After a small bugfix regarding the
datapreprocessing our system outperforms all the other
systems in the sentiment polarity task.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Convolutional Neural Networks</title>
      <p>We train a 2-layer CNN using 9-fold
crossvalidation and combine the outputs of the 9
resulting classifiers to increase robustness. The 9
classifiers differ in the data used for the
supervised phase since cross-validation creates 9
different training and validation sets.</p>
      <p>The architecture of the CNN is shown in
Figure 1 and described in detail below.</p>
      <p>Sentence model. Each word in the input data is
associated to a vector representation, which
consists in a d-dimensional vector. A sentence (or
tweet) is represented by the concatenation of the
representations of its n constituent words. This
yields a matrix S 2 Rd n, which is used as input
to the convolutional neural network.</p>
      <p>The first layer of the network consists of a lookup
table where the word embeddings are represented
as a matrix X 2 Rd jVj, where V is the
vocabulary. Thus the i-th column of X represents the i-th
word in the vocabulary V .</p>
      <p>Convolutional layer. In this layer, a set of m
filters is applied to a sliding window of length h over
each sentence. Let S[i:i+h] denote the
concatenation of word vectors si to si+h. A feature ci is
generated for a given filter F by:
ci :=</p>
      <p>X(S[i:i+h])k;j Fk;j
k;j
(1)
A concatenation of all vectors in a sentence
produces a feature vector c 2 Rn h+1. The vectors
c are then aggregated over all m filters into a
feature map matrix C 2 Rm (n h+1). The filters
are learned during the training phase of the
neural network using a procedure detailed in the next
section.</p>
      <p>Max pooling. The output of the convolutional
layer is passed through a non-linear activation
function, before entering a pooling layer. The
latter aggregates vector elements by taking the
maximum over a fixed set of non-overlapping
intervals. The resulting pooled feature map matrix has
the form: Cpooled 2 Rm n sh+1 , where s is the
length of each interval. In the case of
overlapping intervals with a stride value st, the pooled
feature map matrix has the form Cpooled 2
m n h+1 s
R st . Depending on whether the borders
are included or not, the result of the fraction is
rounded up or down respectively.</p>
      <p>
        Hidden layer. A fully connected hidden layer
computes the transformation (W x + b), where
W 2 Rm m is the weight matrix, b 2 IRm the
bias, and the rectified linear (relu) activation
function
        <xref ref-type="bibr" rid="ref9">(Nair and Hinton, 2010)</xref>
        . The output
vector of this layer, x 2 Rm, corresponds to the
sentence embeddings for each tweet.
      </p>
      <p>Softmax. Finally, the outputs of the hidden layer
x 2 Rm are fully connected to a soft-max
regression layer, which returns the class y^ 2 [1; K] with
largest probability,
y^ := arg max
j</p>
      <p>ex|wj+aj
PK
k=1 ex|wk+aj
;
(2)
where wj denotes the weights vector of class j and
aj the bias of class j.</p>
      <p>Network parameters. Training the neural
network consists in learning the set of parameters
= fX; F1; b1; F2; b2; W; ag, where X is the
embedding matrix, with each row containing the
d-dimensional embedding vector for a specific
word; Fi; bi(i = f1; 2g) the filter weights and
biases of the first and second convolutional layers;
W the concatenation of the weights wj for every
output class in the soft-max layer; and a the bias
of the soft-max layer.</p>
      <p>Hyperparameters For both convolutional
layers we set the length of the sliding window h to
5, the size of the pooling interval s is set to 3 in
both layers, where we use a striding of 2 in the
first layer, and the number of filters m is set to 200
in both convolutional layers.</p>
      <p>
        Dropout Dropout is an alternative technique
used to reduce overfitting
        <xref ref-type="bibr" rid="ref12">(Srivastava et al.,
2014)</xref>
        . In each training stage individual nodes are
dropped with probability p, the reduced neural net
is updated and then the dropped nodes are
reinserted. We apply Dropout to the hidden layer and
to the input layer using p = 0:2 in both cases.
Optimization The network parameters are
learned using AdaDelta
        <xref ref-type="bibr" rid="ref13">(Zeiler, 2012)</xref>
        , which
adapts the learning rate for each dimension using
only first order information. We used the default
hyper-parameters.
We train the parameters of the CNN using the
three-phase procedure as described in the
introduction. Figure 2 depicts the general flow of this
procedure.
3.1
      </p>
      <sec id="sec-2-1">
        <title>Three-Phase Training</title>
        <p>Preprocessing We apply standard
preprocessing procedures of normalizing URLs, hashtags
and usernames, and lowercasing the tweets. The
tweets are converted into a list of indices where
each index corresponds to the word position in the
vocabulary V . This representation is used as
input for the lookup table to assemble the sentence
matrix S.</p>
        <p>
          Word Embeddings We create the word
embeddings in phase P 1 using word2vec
          <xref ref-type="bibr" rid="ref7 ref8">(Mikolov et al.,
2013a)</xref>
          and train a skip-gram model on a corpus of
300M unlabelled Italian tweets. The window size
for the skip-gram model is 5, the threshold for the
minimal word frequency is set to 20 and the
number of dimensions is d = 52 3. The resulting
vocabulary contains 890K unique words. The word
embeddings account for the majority of network
parameters (42:2M out of 46:6M parameters) and
are updated during the next two phases to
introduce sentiment specific information into the word
embeddings and create a good initialization for the
CNN.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Distant Supervised Phase We pre-train the</title>
        <p>CNN for 1 epoch on an weakly labelled dataset
of 40M Italian tweets where each tweet contains
an emoticon. The label is inferred by the
emoticons inside the tweet, where we ignore tweets with
opposite emoticons. This results in 30M positive
tweets and 10M negative tweets. Thus, the
classifier is trained on a binary classification task.
Supervised Phase During the supervised phase
we train the pre-trained CNN with the provided
annotated data. The CNN is trained jointly on all
tasks of EvalItalia. There are four different binary
labels as well as some restrictions which result in
9 possible joint labels (for more details, see
Section 3.2). The multi-task classifier is trained to
predict the most likely joint-label.</p>
        <p>We apply 9-fold cross-validation on the dataset
generating 9 equally sized buckets. In each round
we train the CNN using early stopping on the
heldout set, i.e. we train it as long as the score
improves on the held-out set. For the multi-task
training we monitor the scores for all 4 subtasks
simultaneously and store the best model for each
subtask. The training stops if there is no improvement
of any of the 4 monitored scores.</p>
        <p>Meta Classifier We train the CNN using 9-fold
cross-validation, which results in 9 different
models. Each model outputs nine real-value numbers
3According to the gensim implementation of word2vec
using d divisible by 4 speeds up the process.
y^ corresponding to the probabilities for each of
the nine classes. To increase the robustness of the
system we train a random forest which takes the
outputs of the 9 models as its input. The
hyperparameters were found via grid-search to obtain
the best overall performance over a development
set: Number of trees (100), maximum depth of the
forest (3) and the number of features used per
random selection (5).
3.2</p>
      </sec>
      <sec id="sec-2-3">
        <title>Data</title>
        <p>The supervised training and test data is provided
by the EvalItalia-2016 competition. Each tweet
contains four labels: L1 : is the tweet subjective
or objective? L2 : is the tweet positive? L3 : is the
tweet negative? L4 : is the tweet ironic?
Furthermore an objective tweet implies that it is neither
positive nor negative as well as not ironic. There
are 9 possible combination of labels.</p>
        <p>To jointly train the CNN for all three tasks T 1; T 2
and T 3 we join the labels of each tweet into a
single label. In contrast, the single task training trains
a single model for each of the four labels
separately.</p>
        <p>
          Table 1 shows an overview of the data available.
We compare the performance of the multi-task
CNN with the performance of the single-task
CNNs. All the experiments start at the third-phase,
i.e. the supervised phase. Since there was no
predefined split in training and development set,
we generated a development set by sampling 10%
uniformly at random from the provided training
set. The development set is needed when
assessing the generalization power of the CNNs and the
meta-classifier. For each task we compute the
averaged F1-score
          <xref ref-type="bibr" rid="ref1">(Barbieri et al., 2016)</xref>
          . We present
the results achieved on the dev-set and the test-set
used for the competition. We refer to the set which
was held out during a cross validation iteration as
fold-set.
        </p>
        <p>In Table 2 we show the average results obtained
by the 9 CNNs after the cross validation. The
scores show that the CNN is tuned too much
towards the held-out folds since the scores of the
held-out folds are significantly higher. For
example, the average score of the positivity task is 0:733
on the held-out sets but only 0:6694 on the dev-set
and 0:6601 on the test-set. Similar differences in
sores can be observed for the other tasks as well.
To mitigate this problem we apply a random
forest on the outputs of the 9 classifiers obtained by
cross-validation. The results are shown in Table
3. The meta-classifier outperforms the average
scores obtained by the CNNs by up to 2 points
on the dev-set. The scores on the test-set show
a slightly lower increase in score. Especially the
single-task classifier did not benefit from the
metaclassifier as the scores on the test set decreased in
some cases.</p>
        <p>The results show that the multi-task classifier
outperforms the single-task classifier in most cases.
There is some variation in the magnitude of the
difference: the multi-task classifier outperforms
the single-task classifier by 0:06 points in the
negativity task in the test-set but only by 0:005 points
in the subjectivity task.</p>
        <p>Set Task
Fold-Set Single Task</p>
        <p>Multi Task
Dev-Set Single Task</p>
        <p>Multi Task
Test-Set Single Task</p>
        <p>Multi Task</p>
        <p>Subjective Positive Negative Irony
0.723 0.738 0.721 0.646
0.729 0.733 0.737 0.657
0.696 0.650 0.685 0.563
0.710 0.669 0.699 0.595
0.705 0.652 0.696 0.526
0.681 0.660 0.700 0.540
In this work we presented a deep-learning
approach for sentiment analysis. We described the
three-phase training approach to guarantee a high
quality initialization of the CNN and showed the
effects of using a multi-task training approach. To
increase the robustness of our system we applied
a meta-classifier on top of the CNN. The system
was evaluated in the EvalItalia-2016 competition
where it achieved 1st place in the polarity task and
high positions on the other two subtasks.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Barbieri</surname>
          </string-name>
          , Valerio Basile, Danilo Croce, Malvina Nissim, Nicole Novielli, and
          <string-name>
            <given-names>Viviana</given-names>
            <surname>Patti</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Overview of the EVALITA 2016 SENTiment POLarity Classification Task</article-title>
          . In Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors,
          <source>Proceedings of Third Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2016</year>
          ) &amp;
          <article-title>Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2016</year>
          ).
          <article-title>Associazione Italiana di Linguistica Computazionale (AILC).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Ronan</given-names>
            <surname>Collobert</surname>
          </string-name>
          , Jason Weston, Le´on Bottou, Michael Karlen, Koray Kavukcuoglu, and
          <string-name>
            <given-names>Pavel</given-names>
            <surname>Kuksa</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Natural Language Processing (Almost) from Scratch</article-title>
          .
          <source>JMLR</source>
          ,
          <volume>12</volume>
          :
          <fpage>2493</fpage>
          −
          <lpage>2537</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Rie</given-names>
            <surname>Johnson</surname>
          </string-name>
          and Tong Zhang.
          <year>2015</year>
          .
          <article-title>Semi-supervised Convolutional Neural Networks for Text Categorization via Region Embedding</article-title>
          .
          <source>In NIPS 2015 - Advances in Neural Information Processing Systems</source>
          <volume>28</volume>
          , pages
          <fpage>919</fpage>
          -
          <lpage>927</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Nal</given-names>
            <surname>Kalchbrenner</surname>
          </string-name>
          , Edward Grefenstette, and
          <string-name>
            <given-names>Phil</given-names>
            <surname>Blunsom</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>A Convolutional Neural Network for Modelling Sentences</article-title>
          .
          <source>In ACL - Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics</source>
          , pages
          <fpage>655</fpage>
          -
          <lpage>665</lpage>
          , Baltimore, Maryland, USA, April.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Yoon</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Convolutional Neural Networks for Sentence Classification</article-title>
          .
          <source>In EMNLP 2014 - Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>1746</fpage>
          -
          <lpage>1751</lpage>
          ,
          <year>August</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Bing</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Sentiment Analysis</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Quoc V Le,
          <string-name>
            <given-names>and Ilya</given-names>
            <surname>Sutskever</surname>
          </string-name>
          . 2013a.
          <article-title>Exploiting Similarities among Languages for Machine Translation</article-title>
          . arXiv, September.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Ilya Sutskever, Kai Chen, Greg S Corrado, and
          <string-name>
            <given-names>Jeff</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013b</year>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Vinod</given-names>
            <surname>Nair</surname>
          </string-name>
          and
          <string-name>
            <given-names>Geoffrey E</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Rectified linear units improve restricted boltzmann machines</article-title>
          .
          <source>In Proceedings of the 27th International Conference on Machine Learning (ICML-10)</source>
          , pages
          <fpage>807</fpage>
          -
          <lpage>814</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Aliaksei</given-names>
            <surname>Severyn</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alessandro</given-names>
            <surname>Moschitti</surname>
          </string-name>
          . 2015a.
          <article-title>Twitter Sentiment Analysis with Deep Convolutional Neural Networks</article-title>
          .
          <source>In 38th International ACM SIGIR Conference</source>
          , pages
          <fpage>959</fpage>
          -
          <lpage>962</lpage>
          , New York, USA,
          <year>August</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Aliaksei</given-names>
            <surname>Severyn</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alessandro</given-names>
            <surname>Moschitti</surname>
          </string-name>
          .
          <year>2015b</year>
          .
          <article-title>UNITN: Training Deep Convolutional Neural Network for Twitter Sentiment Classification</article-title>
          .
          <source>In SemEval 2015 - Proceedings of the 9th International Workshop on Semantic Evaluation.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Nitish</given-names>
            <surname>Srivastava</surname>
          </string-name>
          , Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Dropout: A Simple Way to Prevent Neural Networks from Overfitting</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>15</volume>
          :
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Matthew D.</given-names>
            <surname>Zeiler</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>ADADELTA: An Adaptive Learning Rate Method</article-title>
          . arXiv, page 6.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>