=Paper= {{Paper |id=Vol-2150/MultiStanceCat_paper_2 |storemode=property |title=ELiRF-UPV at MultiStanceCat 2018 |pdfUrl=https://ceur-ws.org/Vol-2150/MultiStanceCat_paper2.pdf |volume=Vol-2150 |authors=José-Ángel González,Lluís-Felip Hurtado,Ferran Pla |dblpUrl=https://dblp.org/rec/conf/sepln/GonzalezHP18 }} ==ELiRF-UPV at MultiStanceCat 2018== https://ceur-ws.org/Vol-2150/MultiStanceCat_paper2.pdf
         ELiRF-UPV at MultiStanceCat 2018

            José-Ángel González[0000−0003−3812−5792] , Lluı́s-Felip
       Hurtado[0000−0002−1877−0455] , and Ferran Pla[0000−0003−4822−8808]

                Departament de Sistemes Informàtics i Computació
                       Universitat Politècnica de València
                      {jogonba2,lhurtado,fpla}@dsic.upv.es



      Abstract. This paper describes the participation of ELiRF-UPV team
      at the Spanish subtasks of the MultiModal Stance Detection in tweets
      on Catalan #1Oct Referendum workshop. Our best approach is based
      on Convolutional Neural Networks using word embeddings and polar-
      ity/emotion lexicons. We obtained competitive results on the Spanish
      subtask using only the text of the tweet, dispensing with contexts and
      images.

      Keywords: Deep Learning · Stance Detection · Convolutional Neural
      Networks.


1   Introducction
Stance detection consists of automatically determining from text whether the
author is in favor of the given target, against the given target, or whether nei-
ther inference is likely. Different international competitions have recently shown
interest in these subjects: Stance on Twitter, task 6 at SemEval-2016 [5] and
Stance and Gender detection in Tweets on Catalan Independence (StanceCat-
2017) [9].
    MultiModal Stance Detection in tweets on Catalan #1Oct Referendum (Mul-
tiStanceCat) workshop is one of the tracks proposed at Ibereval 2018 workshop
[10]. The aim of this track is to detect the stance with respect to the target
“independence of Catalonia” in tweets written in Spanish and Catalan, further-
more, it is a multimodal task because both the text of the tweet and up to ten
pictures of the user timeline could be taken into account for determining the
stance.


2   Task Dataset
The corpus is composed by tweets labeled with respect to the stance of the
Catalan first October Referendum (2017). There are three classes: AGAINST
(AG), NEUTRAL (NE) and FAVOR (FA). These tweets are provided in Spanish
and Catalan, however, we worked only with the Spanish subtask. Moreover,
although context of the tweet and images are also provided by the organizers,
we only used the text of the tweet.
Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)




        2        J-A. González et al.

           From the official training corpus, we randomly selected 80% in order to train
        our models. The remaining 20% was used as development set. Table 1 shows the
        sample distribution per class in the Spanish corpus.


                 Table 1. Number of samples per class in the Spanish training corpus.

                                                     Train Dev
                                                  AG 1431 355
                                                  NE 758 213
                                                  FA 1360 320
                                                  Σ 3549 888



        3     System Description

        In this Section, we describe the two models used in the competition. Both mod-
        els share the same preprocessing of the tweets by means of the TweetMotif [2]
        package. We applied a normalization step consisted on lowercasing the words,
        removing some language-specific characters such as accent, dieresis, special lan-
        guage characters, and normalizing Twitter-specific tokens (hashtags, user men-
        tions and urls) by replacing them for a fixed word e.g. #1octL6 → #hashtag.
            As first model for the experimentation, we used a Support Vector Machine
        (SVM) classifier with different representations of the tweets. Concretely, we used
        bag-of-word-ngrams and bag-of-char-ngrams with several values of n (including
        combination of ngrams e.g. bag-of-word-1-4grams means the concatenation of
        n = [1, 4]grams).
            As second model for the experimentation, we used a Convolutional Neural
        Network (CNN) architecture inspired by the work presented in [12], with the
        aim of obtaining representations of the tweets similar to continuous versions
        of the bag-of-ngrams. We represented the tweets using Word2Vec distributed
        representations of words [3] [4]. Moreover, to enrich the system, we used several
        polarity/emotion lexicons combined with the word embeddings.
            We used ELHPolar [8], ISOL [7], MLSenticon [1] and the Spanish version
        of NRC [6] as lexicons. As word embeddings, we trained a skip-gram model,
        with 300 dimensions for each word, from 87 million Spanish tweets collected for
        previous experimental work.
            We represent each tweet x as a matrix S ∈ Rn×(d+v) , where n is the maximum
        number of words per tweet, d is the dimensionality of word embeddings and v is
        the dimensionality of the polarity/emotion features, that is, the number of polar-
        ity/emotion lexicons. In order to obtain this representation, we use an embedding
                                                    0         0      0         0
        model h(w) ∈ Rd and a set of lexicons h (w) = [h1 (w), h2 (w), ..., hl (w)] ∈ Rv ,
                 0
        where hk (w) is the polarity value of the word w in the lexicon k.
            Therefore, given a tweet x with n tokens, x = w1 , w2 , ..., wn , we represent
        it as a matrix S in which, each row i is the concatenation of the embedding of
                                                                                    0
        wi (h(wi )) and a vector with the polarity values of wi in each lexicon (h (wi )),




                                                        174
Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)




                                                                                        ELiRF-UPV at MultiStanceCat 2018                                                               3

                                  0                                 0                                   0
        S = [h(w1 )|h (w1 ), h(w2 )|h (w2 ), ..., h(wn )|h (wn )]. In the case where a word wi
        is out of vocabulary for the embedding models, we replace its embedding by the
        embedding of the word “unknown”, h(wi ) = h(“unknown”). Similarly, if wi is
                                        0
        not included in any lexicon, h (wi ) = [0, 0, ..., 0] ∈ Rv .
            Due to the variable length of the tweets, we used zero padding at the start
        of a tweet if it does not reach the maximum specified length. Otherwise, if the
        length of a tweet is greater than the maximum, we only consider the first n words
        of the tweet. In this task, the average number of words per tweet is navg = 18.5,
        and the maximum length is nmax = 34. We decided to set the length n = 26
        which is the mean of navg and nmax .
            Regarding the CNN architecture, we applied one-dimensional convolutions
        with variable height filters in order to extract the temporal structure of the
        tweet over several region sizes. Figure 1 summarizes the model architecture and
        its hyperparameters.


                                                               Batch Normalization + ReLU

                                        Convolution                                                 1D Global Max Pooling       Concat + Batch Normalization   Softmax fully-connected layer

                                                        4 region sizes ([1, 4])      256 feature maps for         256 salient features       Concatenated salient             3 classes
                                                      256 filters for each region      each region size           for each region size            features
                                                      size 1024 different filters           ℝ4×𝑛 ×256                   ℝ4×256                      ℝ1024
                  Batch Normalization

                                                                    ………
                    Sentence matrix
                                                                    ………
                      ℝ𝑛 ×(𝑑+𝑣)
                                                                    ………
                                                        …           ………                                                     ⋮

                                                                    ………
                                                                    ………

           do                 …

           you                …
                                                                                                                                                                                  𝒐𝟏
          think               …
                                                                 …                                                          …                                                     𝒐𝟐
         humans               …                                                               …                                                       ⋮
                                                                                                                                                                                  𝒐𝟑
          have                …

          sense               …




                                                                         …
                                                                        ……
                                                                                                                            ⋮




                                      Fig. 1. CNN architecture for multi-label classification.



        As can be seen in Figure 1, we used 4 different region sizes (the filter height range
        from 1 to 4) and 256 filters for each region size. We used this range of region
        sizes because, in the development phase, the best baseline was SVM using bag-
        of-word-1-4grams. After the filters were applied, we obtained 256 output feature
        maps for each region size.
            In order to extract the most salient features for each region size, we applied
        1D Global Max Pooling to the feature maps of each region size. Therefore, we
        obtained 4 vectors with 256 components, that were concatenated and used as




                                                                                            175
Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)




        4        J-A. González et al.

        input to a fully-connected layer which performs the classification task. We used
        a softmax activation function to model the posterior distribution of each class
        at the output layer.

        4     Experimental results
        In this section, we describe the experimental work conducted by ELiRF team in
        the MultiStanceCat task. In addition, we present a study of the performance of
        our best system in the competition.
            Table 2 summarizes the results obtained in the development phase. Three
        different classifiers were considered: Linear SVM with bag-of-ngrams of words,
        Linear SVM with bag-of-ngrams of chars and CNNs.
            For the SVM approaches we tested different values of n. With respect to
        the CNN we explored three loss functions: Cross Entropy (CCE), Mean Squared
        Error (MSE) and differentiable approximation of the F1 measure (SMF1 ).


        Table 2. Results obtained with the different approaches considered in the development
        phase.

                                                                                    F1 (AG) + F1 (F A)
                     Experiments                    F1 (AG) F1 (NE) F1 (FA)
                                                                                            2
                                      1-1-grams       59.48     46.19      57.06          58.27
                                      1-2-grams       62.95     49.73      56.07          59.51
                        Word
                                      1-3-grams       64.09     52.25      59.80           61.94
                       Ngrams
                                      1-4-grams       64.87     49.86      59.30           62.09
            Linear
                                      1-1-grams       47.46     39.90      53.32           50.39
            SVM
                                      1-2-grams       57.25     48.49      37.40           47.32
                        Char
                                      1-3-grams       55.91     45.54      51.52           53.72
                       Ngrams
                                      1-4-grams       58.63     48.02      51.52           53.72
                                      1-5-grams       59.55     48.76      55.59           57.57
                                      1-6-grams       60.76     49.12      57.36           59.06
                                        MSE           65.27     50.91      60.56           62.91
                     Embeddings
            CNN          +               CCE          64.37     50.56      61.12           62.75
                      Lexicons
                                         SMF1         67.13     50.15      63.51           65.32



        It can be observed in Table 2 that generally bag-of-chars performs worse than
        bag-of-words. Note that CNN models outperform the results achieved by the
        SVM classifiers. Moreover, CNN classifier with SMF1 loss function outperforms
        the results of all the other classifiers. However, a deeper study about which
        factors such as embeddings or lexicons are more relevant in the results would be
        interesting.




                                                        176
Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)




                                                            ELiRF-UPV at MultiStanceCat 2018                         5

            It can be also observed that the value of the F1 measure for the NEUTRAL
        class (F1 (N E)) is generally lower than the F1 measures for AGAINST and FA-
        VOR classes (F1 (AG), F1 (F A)). We hypothesize this is due to the fact that
        NEUTRAL class has less samples in the corpus. However, low values of F1 (N E)
        measure do not affect the official evaluation measure that is defined as the aver-
        age between F1 (AG) and F1 (F A).
            For the Spanish subtask competition, we selected the best CNN and SVM
        models according to the results obtained in the development phase. Concretely,
        our first run (ELiRF-1) was the CNN model trained using SMF1 loss function. As
        a second run (ELiRF-2) we selected a Linear SVM with bag-of-word-1-4grams.
            Table 3 shows the confusion matrices of the two submitted systems. It can be
        observed that both systems confuse the NEUTRAL and the AGAINST classes
        in a similar way. The best performance achieved by ELiRF-1 run is because it
        predicts better the FAVOR class.


                     Table 3. Confusion matrices for ELiRF-1 and ELiRF-2 systems.

                                                   ELiRF-1    ELiRF-2
                                                        Predicted
                                                  AG NE FA AG NE FA
                                               AG 240 18 97 241 21 93
                                         Truth NE 54 86 73 56 86 71
                                               FA 66 26 228 91 25 204




        We have also performed a study of the samples that ELiRF-1 system misclassified
        with high confidence. Some of these samples are shown in Table 4. We think
        that in some cases, errors could be avoided by considering hashtags (sample 5,
        #4gatos) or user mentions (error 2, @CatalunyaPlural). Unfortunately, we have
        not included this information in our models.


                   Table 4. Examples of misclassifications with maximum confidence.

         (1) Class FA, Predicted AG with 100% confidence: #1octL6 No puede haber referéndum pactado por
         polı́ticos Estáis vendiendo una falacia La soberanı́a es nuestra

         (2) Class FA, Predicted NE with 99.66% confidence: La policı́a nacional cita a declarar al lı́der del Partido
         Pirata de Catalunya por el referéndum https://t.co/dOq4igEDKf @CatalunyaPlural #1O

         (3) Class NE, Predicted FA with 99.99% confidence: Gran vı́deo para aquellos que sois normales. #his-
         panoMola #A3dias1OctARV #1octubreARV https://t.co/e8618WfUmA

         (4) Class NE, Predicted AG with 99.99% confidence: #1octL6 yo pensaba que en la jornada de reflexión no
         se puede debatir ????????

         (5) Class AG, Predicted FA with 100% confidence: @InesArrimadas @carrizosacarlos ya era hora que vuestra
         mayorı́a silenciosa saliera a la calle. #4gatos #1O. . . https://t.co/RdRdn8eWnH

         (6) Class AG, Predicted NE with 99.99% confidence: De la Revolución de las Sonrisas al Conflicto Civil
         https://t.co/HwVbPBmEKM #1octL6 https://t.co/URtNC6YOXv




                                                            177
Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)




        6        J-A. González et al.

        Table 5 shows the results on the test set for all the participating teams in the
        Spanish task. The ELiRF-1 run obtained competitive results without using the
        text of previous and next tweets or the images in the user timeline. Moreover, we
        can observe that the context seems to be useful for this task because all the best
        participating teams used this information. Finally, we would like to highlight
        the great difference observed in the results obtained on the development and the
        test sets. We have no explanation for this, but we think that a study about this
        aspect should be done when the test set will be available.


                               Table 5. Test results on the Spanish subtask.

                               Team       Run                 Macro F1
                               uc3m       text+context        28.02
                               CriCa      context             27.15
                               Casacufans text+context+images 27.09
                               Casacufans text+context        26.98
                               ELiRF-1    text                22.74
                               uc3m       text                22.47
                               CriCa      text                22.06
                               Casacufans text                21.94
                               ELiRF-2    text                21.32



        5     Conclusions and Future Work

        In this paper, we have presented the participation of the ELiRF team at Multi-
        StanceCat track of the IberEval workshop. Our team participated in the Spanish
        subtask of this track and competitive results were achieved using only the text of
        the tweets. Our best approach is based on CNN with sequential representation
        of the tweets using word embedding, and polarity/emotion lexicons.
            As future work, we plan to include the context of the tweet in our deep learn-
        ing system in a similar way as Hierarchical Attention Networks [11] do. Moreover,
        we think that data augmentation could help to improve the performance of the
        models.
            We have observed that hashtags and user mentions contains relevant infor-
        mation for this task. For this reason, as future work, we want to explore the
        inclusion of this information in the tweet representation.


        6     Acknowledgements

        This work has been partially supported by the Spanish MINECO and FEDER
        founds under project AMIC (TIN2017-85854-C4-2-R). Work of José-Ángel González
        is also financed by Universitat Politècnica de València under grant PAID-01-17.




                                                        178
Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018)




                                                       ELiRF-UPV at MultiStanceCat 2018                    7

        References
         1. Cruz, F.L., Troyano, J.A., Pontes, B., Ortega, F.J.: Building layered, multilingual
            sentiment lexicons at synset and lemma levels. Expert Systems with Applications
            41(13), 5984 – 5994 (2014)
         2. Krieger, M., Ahn, D.: Tweetmotif: exploratory search and topic summarization for
            twitter. In: In Proc. of AAAI Conference on Weblogs and Social (2010)
         3. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of
            word representations in vector space. CoRR abs/1301.3781 (2013),
            http://arxiv.org/abs/1301.3781
         4. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed represen-
            tations of words and phrases and their compositionality. CoRR abs/1310.4546
            (2013), http://arxiv.org/abs/1310.4546
         5. Mohammad, S., Kiritchenko, S., Sobhani, P., Zhu, X., Cherry, C.: Semeval-2016
            task 6: Detecting stance in tweets. In: Proceedings of the 10th International Work-
            shop on Semantic Evaluation (SemEval-2016). pp. 31–41 (2016)
         6. Mohammad, S.M., Turney, P.D.: Crowdsourcing a word-emotion association lexi-
            con. Computational Intelligence 29(3), 436–465 (2013)
         7. Molina-González, M.D., Martı́nez-Cámara, E., Martı́n-Valdivia, M.T., Perea-
            Ortega, J.M.: Semantic orientation for polarity classification in spanish reviews.
            Expert Systems with Applications 40(18), 7250 – 7257 (2013)
         8. Saralegi, X., San Vicente, I.: Elhuyar at tass 2013. In: XXIX Congreso de la So-
            ciedad Espaola de Procesamiento de lenguaje natural, Workshop on Sentiment
            Analysis at SEPLN (TASS2013). pp. 143–150 (2013)
         9. Taulé, M., Martı́, M.A., Rangel, F.M., Rosso, P., Bosco, C., Patti, V., et al.:
            Overview of the task on stance and gender detection in tweets on catalan inde-
            pendence at ibereval 2017. In: 2nd Workshop on Evaluation of Human Language
            Technologies for Iberian Languages, IberEval 2017. vol. 1881, pp. 157–177. CEUR-
            WS (2017)
        10. Taulé, M., Rangel, F.M., Martı́, M.A., Rosso, P.: Overview of the task on multi-
            modal stance detection in tweets on catalan #1oct referendum’. In: Third Work-
            shop on Evaluation of Human Language Technologies for Iberian Languages,
            IberEval 2018 (2018)
        11. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention
            networks for document classification. In: Proceedings of the 2016 Conference of
            the North American Chapter of the Association for Computational Linguistics:
            Human Language Technologies. pp. 1480–1489 (2016)
        12. Zhang, Y., Wallace, B.: A sensitivity analysis of (and practitioners’ guide to) con-
            volutional neural networks for sentence classification. In: Proceedings of the Eighth
            International Joint Conference on Natural Language Processing (Volume 1: Long
            Papers). pp. 253–263. Asian Federation of Natural Language Processing (2017),
            http://aclweb.org/anthology/I17-1026




                                                        179