Sentiment Analysis using Convolutional Neural Networks with Multi-Task
          Training and Distant Supervision on Italian Tweets

                       Jan Deriu                                  Mark Cieliebak
          Zurich University of Applied Sciences         Zurich University of Applied Sciences
                      Switzerland                                   Switzerland
                   deri@zhaw.ch                                  ciel@zhaw.ch

                         Abstract                       T 3 : Irony detection: is the tweet ironic?
                                                         The classic approaches to sentiment analysis usu-
    English. In this paper, we propose a clas-          ally consist of manual feature engineering and ap-
    sifier for predicting sentiments of Italian         plying some sort of classifier on these features
    Twitter messages. This work builds upon             (Liu, 2015). Deep neural networks have shown
    a deep learning approach where we lever-            great promises at capturing salient features for
    age large amounts of weakly labelled data           these complex tasks (Mikolov et al., 2013b; Sev-
    to train a 2-layer convolutional neural net-        eryn and Moschitti, 2015a). Particularly success-
    work. To train our network we apply a               ful for sentiment classification were Convolutional
    form of multi-task training. Our system             Neural Networks (CNN) (Kim, 2014; Kalchbren-
    participated in the EvalItalia-2016 com-            ner et al., 2014; Severyn and Moschitti, 2015b;
    petition and outperformed all other ap-             Johnson and Zhang, 2015), on which our work
    proaches on the sentiment analysis task.            builds upon. These networks typically have a large
    In questo articolo, presentiamo un sis-             number of parameters and are especially effective
    tema per la classificazione di soggettività        when trained on large amounts of data.
    e polarità di tweet in lingua italiana.             In this work, we use a distant supervision ap-
    L’approccio descritto si basa su reti neu-          proach to leverage large amounts of data in order
    rali. In particolare, utilizziamo un dataset        to train a 2-layer CNN 2 . More specifically, we
    di 300M di tweet per addestrare una con-            train a neural network using the following three-
    volutional neural network. Il sistema è            phase procedure: P 1 : creation of word embed-
    stato addestrato e valutato sui dati for-           dings for the initialization of the first layer based
    niti dagli organizzatori di Sentipolc, task         on an unsupervised corpus of 300M Italian tweets;
    di sentiment analysis su Twitter organiz-           P 2 : distant supervised phase, where the net-
    zato nell’ambito di Evalita 2016..                  work is pre-trained on a weakly labelled dataset of
                                                        40M tweets where the network weights and word
                                                        embeddings are trained to capture aspects related
1    Introduction                                       to sentiment; and P 3 : supervised phase, where
                                                        the network is trained on the provided supervised
Sentiment analysis is a fundamental problem aim-
                                                        training data consisting of 7410 manually labelled
ing to give a machine the ability to understand the
emotions and opinions expressed in a written text.
                                                         As the three tasks of EvalItalia-2016 are closely
This is an extremely challenging task due to the
                                                        related we apply a form of multitask training as
complexity and variety of human language.
                                                        proposed by (Collobert et al., 2011), i.e. we train
 The sentiment polarity classification task of
                                                        one CNN for all the tasks simultaneously. This
EvalItalia-2016 1 (sentipolc) consists of three sub-
                                                        has two advantages: i) we need to train only
tasks which cover different aspects of sentiment
                                                        one model instead of three models, and ii) the
detection: T 1 : Subjectivity detection: is the tweet
                                                        CNN has access to more information which ben-
subjective or objective? T 2 : Polarity detection: is
                                                        efits the score. The experiments indicate that the
the sentiment of the tweet neutral, positive, nega-
                                                        multi-task CNN performs better than the single-
tive or mixed?
   1                                                       2
     http://www.di.unito.it/˜tutreeb/sentipolc-              We here refer to a layer as one convolutional and one
evalita16/index.html                                    pooling layer.
task CNN. After a small bugfix regarding the data-     feature map matrix has the form Cpooled ∈
preprocessing our system outperforms all the other      m× n−h+1−s
                                                       R       st   . Depending on whether the borders
systems in the sentiment polarity task.                are included or not, the result of the fraction is
                                                       rounded up or down respectively.
2   Convolutional Neural Networks
                                                       Hidden layer. A fully connected hidden layer
We train a 2-layer CNN using 9-fold cross-
                                                       computes the transformation α(W ∗x+b), where
validation and combine the outputs of the 9 re-
                                                       W ∈ Rm×m is the weight matrix, b ∈ IRm the
sulting classifiers to increase robustness. The 9
                                                       bias, and α the rectified linear (relu) activation
classifiers differ in the data used for the super-
                                                       function (Nair and Hinton, 2010). The output vec-
vised phase since cross-validation creates 9 differ-
                                                       tor of this layer, x ∈ Rm , corresponds to the sen-
ent training and validation sets.
                                                       tence embeddings for each tweet.
   The architecture of the CNN is shown in Fig-
ure 1 and described in detail below.                   Softmax. Finally, the outputs of the hidden layer
                                                       x ∈ Rm are fully connected to a soft-max regres-
Sentence model. Each word in the input data is         sion layer, which returns the class ŷ ∈ [1, K] with
associated to a vector representation, which con-      largest probability,
sists in a d-dimensional vector. A sentence (or
tweet) is represented by the concatenation of the                               ex wj +aj
                                                                ŷ := arg max PK              ,          (2)
representations of its n constituent words. This                           j        x| wk +aj
                                                                               k=1 e
yields a matrix S ∈ Rd×n , which is used as input
to the convolutional neural network.                   where wj denotes the weights vector of class j and
The first layer of the network consists of a lookup    aj the bias of class j.
table where the word embeddings are represented        Network parameters. Training the neural net-
as a matrix X ∈ Rd×|V| , where V is the vocabu-        work consists in learning the set of parameters
lary. Thus the i-th column of X represents the i-th    Θ = {X, F1 , b1 , F2 , b2 , W, a}, where X is the
word in the vocabulary V .                             embedding matrix, with each row containing the
Convolutional layer. In this layer, a set of m fil-    d-dimensional embedding vector for a specific
ters is applied to a sliding window of length h over   word; Fi , bi (i = {1, 2}) the filter weights and bi-
each sentence. Let S[i:i+h] denote the concatena-      ases of the first and second convolutional layers;
tion of word vectors si to si+h . A feature ci is      W the concatenation of the weights wj for every
generated for a given filter F by:                     output class in the soft-max layer; and a the bias
                                                       of the soft-max layer.
             ci :=      (S[i:i+h] )k,j · Fk,j    (1)   Hyperparameters For both convolutional lay-
                                                       ers we set the length of the sliding window h to
A concatenation of all vectors in a sentence pro-      5, the size of the pooling interval s is set to 3 in
duces a feature vector c ∈ Rn−h+1 . The vectors        both layers, where we use a striding of 2 in the
c are then aggregated over all m filters into a fea-   first layer, and the number of filters m is set to 200
ture map matrix C ∈ Rm×(n−h+1) . The filters           in both convolutional layers.
are learned during the training phase of the neu-      Dropout Dropout is an alternative technique
ral network using a procedure detailed in the next     used to reduce overfitting (Srivastava et al.,
section.                                               2014). In each training stage individual nodes are
                                                       dropped with probability p, the reduced neural net
Max pooling. The output of the convolutional
                                                       is updated and then the dropped nodes are rein-
layer is passed through a non-linear activation
                                                       serted. We apply Dropout to the hidden layer and
function, before entering a pooling layer. The lat-
                                                       to the input layer using p = 0.2 in both cases.
ter aggregates vector elements by taking the max-
imum over a fixed set of non-overlapping inter-        Optimization The network parameters are
vals. The resulting pooled feature map matrix has      learned using AdaDelta (Zeiler, 2012), which
the form: Cpooled ∈ Rm× s , where s is the             adapts the learning rate for each dimension using
length of each interval. In the case of overlap-       only first order information. We used the default
ping intervals with a stride value st , the pooled     hyper-parameters.
                                                      Convolutional                 pooled                 Convolutional   pooled   Hidden    Softmax
                          Sentence Matrix
                                                      Feature Map                    repr.                 Feature Map      repr.    Layer

                                               Figure 1: The architecture of the CNN used in our approach.

                                                                                                         ber of dimensions is d = 52 3 . The resulting vo-
                                                           Word Embeddings
                                                            cat   0.1   0.9   0.3                        cabulary contains 890K unique words. The word
                       Raw Tweets       word2vec           cats   0.3   0.2   0.7
                         (200M)          GloVe             cute
                                                                  0.2   0.3   0.1                        embeddings account for the majority of network
                                                          Adapted Word Emb.
                                                                                      3-Phase Training   parameters (42.2M out of 46.6M parameters) and
                :-)      Smiley          Distant
                                                                                                         are updated during the next two phases to intro-
                :-(      Tweets        Supervision         cute   0.2   0.3   0.1
                         (90M) .                             …                                           duce sentiment specific information into the word
                                                                                                         embeddings and create a good initialization for the
                                            2-Layer                                                      CNN.
                                                                                                         Distant Supervised Phase We pre-train the

                                                                                                         CNN for 1 epoch on an weakly labelled dataset
                         Unknown       Predictive
                          Tweet          Model                                                           of 40M Italian tweets where each tweet contains
                                                                                                         an emoticon. The label is inferred by the emoti-
          Figure 2: The overall architecture of our 3-phase approach.                                    cons inside the tweet, where we ignore tweets with
                                                                                                         opposite emoticons. This results in 30M positive
                                                                                                         tweets and 10M negative tweets. Thus, the classi-
      3       Training                                                                                   fier is trained on a binary classification task.
                                                                                                         Supervised Phase During the supervised phase
      We train the parameters of the CNN using the
                                                                                                         we train the pre-trained CNN with the provided
      three-phase procedure as described in the intro-
                                                                                                         annotated data. The CNN is trained jointly on all
      duction. Figure 2 depicts the general flow of this
                                                                                                         tasks of EvalItalia. There are four different binary
                                                                                                         labels as well as some restrictions which result in
                                                                                                         9 possible joint labels (for more details, see Sec-
      3.1       Three-Phase Training                                                                     tion 3.2). The multi-task classifier is trained to pre-
                                                                                                         dict the most likely joint-label.
      Preprocessing We apply standard preprocess-
                                                                                                         We apply 9-fold cross-validation on the dataset
      ing procedures of normalizing URLs, hashtags
                                                                                                         generating 9 equally sized buckets. In each round
      and usernames, and lowercasing the tweets. The
                                                                                                         we train the CNN using early stopping on the held-
      tweets are converted into a list of indices where
                                                                                                         out set, i.e. we train it as long as the score im-
      each index corresponds to the word position in the
                                                                                                         proves on the held-out set. For the multi-task train-
      vocabulary V . This representation is used as in-
                                                                                                         ing we monitor the scores for all 4 subtasks simul-
      put for the lookup table to assemble the sentence
                                                                                                         taneously and store the best model for each sub-
      matrix S.
                                                                                                         task. The training stops if there is no improvement
                                                                                                         of any of the 4 monitored scores.
      Word Embeddings We create the word embed-
      dings in phase P 1 using word2vec (Mikolov et al.,                                                 Meta Classifier We train the CNN using 9-fold
      2013a) and train a skip-gram model on a corpus of                                                  cross-validation, which results in 9 different mod-
      300M unlabelled Italian tweets. The window size                                                    els. Each model outputs nine real-value numbers
      for the skip-gram model is 5, the threshold for the                                                   3
                                                                                                              According to the gensim implementation of word2vec
      minimal word frequency is set to 20 and the num-                                                   using d divisible by 4 speeds up the process.
ŷ corresponding to the probabilities for each of            scores show that the CNN is tuned too much to-
the nine classes. To increase the robustness of the          wards the held-out folds since the scores of the
system we train a random forest which takes the              held-out folds are significantly higher. For exam-
outputs of the 9 models as its input. The hyper-             ple, the average score of the positivity task is 0.733
parameters were found via grid-search to obtain              on the held-out sets but only 0.6694 on the dev-set
the best overall performance over a development              and 0.6601 on the test-set. Similar differences in
set: Number of trees (100), maximum depth of the             sores can be observed for the other tasks as well.
forest (3) and the number of features used per ran-          To mitigate this problem we apply a random for-
dom selection (5).                                           est on the outputs of the 9 classifiers obtained by
                                                             cross-validation. The results are shown in Table
3.2   Data                                                   3. The meta-classifier outperforms the average
The supervised training and test data is provided            scores obtained by the CNNs by up to 2 points
by the EvalItalia-2016 competition. Each tweet               on the dev-set. The scores on the test-set show
contains four labels: L1 : is the tweet subjective           a slightly lower increase in score. Especially the
or objective? L2 : is the tweet positive? L3 : is the        single-task classifier did not benefit from the meta-
tweet negative? L4 : is the tweet ironic? Further-           classifier as the scores on the test set decreased in
more an objective tweet implies that it is neither           some cases.
positive nor negative as well as not ironic. There           The results show that the multi-task classifier out-
are 9 possible combination of labels.                        performs the single-task classifier in most cases.
To jointly train the CNN for all three tasks T 1, T 2        There is some variation in the magnitude of the
and T 3 we join the labels of each tweet into a sin-         difference: the multi-task classifier outperforms
gle label. In contrast, the single task training trains      the single-task classifier by 0.06 points in the neg-
a single model for each of the four labels sepa-             ativity task in the test-set but only by 0.005 points
rately.                                                      in the subjectivity task.
Table 1 shows an overview of the data available.             Set      Task        Subjective Positive Negative Irony
                                                             Fold-Set Single Task   0.723     0.738    0.721 0.646
Table 1: Overview of datasets provided in EvalItalia-2016.
                                                                      Multi Task    0.729     0.733    0.737 0.657
        Label            Training Set Test Set               Dev-Set Single Task    0.696     0.650    0.685 0.563
                                                                      Multi Task    0.710     0.669    0.699 0.595
        Total                   7410    2000
        Subjective              5098    1305                 Test-Set Single Task   0.705     0.652    0.696 0.526
        Overall Positive        2051      352                         Multi Task    0.681     0.660    0.700 0.540
        Overall Negative        2983      770
        Irony                    868      235                Table 2: Average F1-score obtained after applying cross val-

3.3   Experiments & Results                                  Set      Task        Subjective Positive Negative Irony
                                                             Dev-Set Single Task    0.702     0.693    0.695 0.573
We compare the performance of the multi-task                          Multi Task    0.714     0.686    0.717 0.604
CNN with the performance of the single-task                  Test-Set Single Task   0.712     0.650    0.643 0.501
                                                                      Multi Task    0.714     0.653    0.713 0.536
CNNs. All the experiments start at the third-phase,
i.e. the supervised phase. Since there was no                     Table 3: F1-Score obtained by the meta classifier.
predefined split in training and development set,
we generated a development set by sampling 10%               4   Conclusion
uniformly at random from the provided training
set. The development set is needed when assess-              In this work we presented a deep-learning ap-
ing the generalization power of the CNNs and the             proach for sentiment analysis. We described the
meta-classifier. For each task we compute the av-            three-phase training approach to guarantee a high
eraged F1-score (Barbieri et al., 2016). We present          quality initialization of the CNN and showed the
the results achieved on the dev-set and the test-set         effects of using a multi-task training approach. To
used for the competition. We refer to the set which          increase the robustness of our system we applied
was held out during a cross validation iteration as          a meta-classifier on top of the CNN. The system
fold-set.                                                    was evaluated in the EvalItalia-2016 competition
In Table 2 we show the average results obtained              where it achieved 1st place in the polarity task and
by the 9 CNNs after the cross validation. The                high positions on the other two subtasks.
