ELiRF-UPV at IberEval 2017: Classification Of
      Spanish Election Tweets (COSET)

               José-Ángel González, Ferran Pla, Lluı́s-F. Hurtado

                Departament de Sistemes Informàtics i Computació
                       Universitat Politècnica de València
                    {jogonba2,fpla,lhurtado}@dsic.upv.es


      Abstract. This paper describes the participation of ELiRF-UPV team
      at Classification Of Spanish Election Tweets (COSET) task. We tested
      several approaches based on different classifiers and features representa-
      tions. Our main approach is based on neural networks, concretely, Multi-
      layer Perceptrons (MLP) with bag-of-words representation of the tweets.
      Our system achieved the best score on the test set of the COSET task
      with 64.82 of macro-F1 .

      Keywords: Neural Networks, MLP, bag-of-words


1   Introduction
Text Categorization (TC) is a well-known and widely applied machine-learning
technique for classifying textual documents into categories [12]. The goal of TC
is the classification of documents into a fixed number of predefined categories.
    Recently, there have been several works that adapt traditional TC methods
for classifying Twitter posts into a predefined set of generic classes [3] [8].
    The aim of COSET shared task at Ibereval 2017 [4] is to classify a corpus of
political tweets in five categories: political issues related to electoral confronta-
tion, sectoral policies, personal questions about the candidates, issues of the
electoral campaign and other issues. This data has been collected during the
Spanish electoral campaign of 2015.
    The present paper describes the participation of the ELiRF-UPV team at
COSET shared task, the way in which the task has been addressed and the
results obtained using different approaches.


2   Corpus Description
The corpus is composed by tweets belonging to five categories: political issues,
related to the most abstract electoral confrontation; policy issues, about sectorial
policies; personal issues, on the life and activities of the candidates; campaign
issues, related with the evolution of the campaign; and other issues.
    These tweets are written in Spanish and were collected during the general
Spanish elections of 2015. The organization provided this corpus divided in three
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)


           parts. The training set (2242 tweets), the development set (250 tweets) and the
           test set (624 tweets).


               Table 1. Topic distribution of the tweets in the Training and Development sets.

                            Topic            #tweets Training #tweets Development
                            Political issues       530 (23%)             57 (23%)
                            Policy issues          786 (35%)             89 (35%)
                            Campaign issues        511 (23%)             71 (28%)
                            Personal issues        152 ( 7%)               9 ( 4%)
                            Other issues           263 (12%)             25 (10%)
                            Total                2242 (100%)           250 (100%)


               Note that the topic distribution of the corpus is not uniform (see Table 1).
           The three majority classes (political issues, policy issues and personal issues)
           represent more than the 80% of the samples. Moreover, both training and de-
           velopment sets have similar distributions of tweets per category.


           3     System Description
           In this section we describe the main characteristics of the system developed to the
           COSET task competition. This description includes the preprocessing used, the
           feature selection process and the different models that were taken into account
           during the tuning phase.

           3.1     Preprocessing
           In the preprocessing conducted all the tweets were converted to lowercase and the
           accents were removed. In all cases, Url’s, user’s mentions, numbers, exclamations
           and interrogations are replaced by a specific label. On the contrary, hashtags were
           not replaced because we observed that hashtags were relevant in the classification
           process.

           3.2     Models and Feature Selection
           Different classifiers and tweets representations were considered. Three models
           were tested: Support Vector Machines (SVM) with linear kernel, Multilayer Per-
           ceptron (MLP), and deep learning models (CNN+LSTM), that we have recently
           used in Sentiment Analysis tasks [5].
               We selected the most appropriate tweets representations for each considered
           model. This way, we tested SVM and MLP models with bag-of-ngrams (n ≥ 1) of
           words and characters and TF-IDF weighting; and CNN+LSTM with sequences
           of Spanish Wikipedia embeddings [9] [10] [13] [11] and sequences of one-hot
           vectors.


                                                                                                                        56
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)


              Table 2 shows the results of the different combinations considered in the
           development phase.


            Table 2. Results on the development set with different models and representations.

                                  System     Features            Macro-F1
                                  MLP        BOW                   68.23
                                  MLP        TF-IDF                 59.31
                                  MLP        Collapse embeddings    46.73
                                  SVM-Linear BOW                    58.66
                                  SVM-Linear TF-IDF                 55.16
                                  SVM-Linear Collapse embeddings    48.85
                                  CNN+LSTM One-Hot sequence         40.68
                                  CNN+LSTM Embedding sequence       55.43


               The best results were obtained using Multilayer Perceptron (MLP) models
           with bag-of-words (BOW) as representation of tweets. Support Vector Machines
           also presented good results, in particular by using BOW representations. Con-
           trary to what we expected, deep learning models obtained the worst results.
               Specifically, our best model was a Multilayer Perceptron of 3 hidden layers
           with 128 neurons and ReLU as activation function in all layers (except Softmax
           in the output layer).
               The optimization method was Adagrad [2], which has provided the best re-
           sults in similar TC tasks, and categorical cross-entropy as a loss function. In
           addition, a scaling of the loss function is used to deal with the problem of unbal-
           anced classes. This scaling method has the following form: floss (x) · log(µ · nnrc )
           where nr is the number of samples of the majority class and nc is the number
           of samples of the class of sample x.


           4     Results
           In view of the results obtained during the tuning phase, we decided to send four
           systems with the same base architecture (Multilayer Perceptron with bag-of-
           words representation), but with some minor changes:

             – System-1. A MLP trained using all data (training and development sets)
               with the parameters adjusted on the development set.
             – System-2. A MLP trained using only the training set.
             – System-3. A MLP trained using with all data, but without scaling the loss
               function.
             – System-4. A majority voting schema of the three previous systems.

              The official results achieved by our four systems are shown in Table 3. We
           have also included, in parenthesis, the position of each system in the ranking of
           the COSET competition.


                                                                                                                        57
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)


                                    Table 3. Results of ELiRF-UPV systems.

                                                System   Macro-F1
                                                System-1 64.82 (1)
                                                System-2 62.33 (5)
                                                System-3 63.33 (4)
                                                System-4 64.00 (2)


              The best results of the competition were obtained by our System-1, which
           achieved 64.82 of macro-F1 on the test set. The scaling of the loss function
           adjusted on the development set also behaved well when the system is trained
           with all samples.
               System-2 obtained the worst results of our submissions (5th position over 39
           submissions). Note that, this was the system learned with less data. System-3 was
           learned with the same samples of System-1 but without scaling the loss function.
           From the results achieved by System-3, we want to highlight the importance of
           dealing with the problem of unbalanced classes.
              The majority voting proposed (System-4) could not outperform the results of
           the best model (System-1). Perhaps, a more sophisticated combination method
           could obtain better results.
               Finally, our team has participated in the second evaluation phase proposed
           by the COSET organization. In this second phase, a new corpus was automat-
           ically labeled with the agreement of four of the five participant runs (80% of
           agreement). A 65.91% of the considered tweets achieved the agreement crite-
           rion. Therefore, the final corpus size was 10,417,058 tweets. Table 4 shows the
           results of the second evaluation phase, sorted by macro-F1 measure. We achieved
           the best two results in terms of macro-F1 in this second phase. In this case, the
           best system, ELiRF-UPV.1, was trained with the whole COSET corpus and the
           second system, ELiRF-UPV.2, was learned using only the training set of the
           COSET corpus.


                          Table 4. Results of ELiRF-UPV systems with large corpus.

                                               Team        macro-F1
                                               ELiRF-UPV.1 95.86
                                               ELiRF-UPV.2 95.23
                                               Team 17     94.82
                                               atoppe      89.60
                                               LTRC IIITH 85.09


                                                                                                                        58
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)


           5     Conclusions and Future Work

           We have presented the participation of ELiRF-UPV team at Classification Of
           Spanish Election Tweets (COSET) task. We tested several approaches. The best
           results were obtained by Multilayer Perceptrons models with bag-of-words rep-
           resentation. Our system achieved the best score on the competition with 64.82
           of macro-F1 .
               In this task, the scaling of the loss function seems to be a key factor to
           improve the results, as it can be observed in the final ranking, where the system
           that uses this scaling performs better than the systems that do not use it.
               As future work, we plan to study the generation of synthetic samples to tackle
           with the unbalance problem. We want to test data-augmentation techniques and
           generative models as SMOTE [1], GAN [6], or VAE [7].


           Acknowledgements

           This work has been partially supported by the Spanish MINECO and FEDER
           founds under project ASLP-MULAN: Audio, Speech and Language Processing
           for Multimedia Analytics, TIN2014-54288-C4-3-R.


           References

            1. Bowyer, K.W., Chawla, N.V., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic
               minority over-sampling technique. CoRR abs/1106.1813 (2011), http://arxiv.
               org/abs/1106.1813
            2. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning
               and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (Jul 2011), http:
               //dl.acm.org/citation.cfm?id=1953048.2021068
            3. Fiaidhi, J., Mohammed, S., Islam, A., Fong, S., Kim, T.h.: Developing a hierarchi-
               cal multi-label classifier for twitter trending topics. International Journal of U-&
               E-Service, Science & Technology 6(3) (2013)
            4. Giménez, M., Baviera, T., Llorca, G., Gámir, J., Calvo, D., Rosso, P., Rangel, F.:
               Overview of the 1st Classification of Spanish Election Tweets Task at IberEval
               2017. In: Proceedings of the Second Workshop on Evaluation of Human Language
               Technologies for Iberian Languages (IberEval 2017). CEUR Workshop Proceedings.
               CEUR-WS.org, Murcia, Spain (2017)
            5. González, J.A., Pla, F., Hurtado, L.F.: ELiRF-UPV at SemEval-2017 Task 4: Sen-
               timent Analysis using Deep Learning. In: Proceedings of the 11th International
               Workshop on Semantic Evaluation. pp. 722–726. SemEval ’17, Association for Com-
               putational Linguistics, Vancouver, Canada (August 2017)
            6. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
               Courville, A.C., Bengio, Y.: Generative adversarial networks. CoRR abs/1406.2661
               (2014), http://arxiv.org/abs/1406.2661
            7. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. CoRR abs/1312.6114
               (2013), http://arxiv.org/abs/1312.6114


                                                                                                                        59
Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017)


            8. Lee, K., Palsetia, D., Narayanan, R., Patwary, M.M.A., Agrawal, A., Choudhary,
               A.: Twitter trending topic classification. In: Proceedings of the 2011 IEEE 11th
               International Conference on Data Mining Workshops. pp. 251–258. ICDMW ’11,
               IEEE Computer Society, Washington, DC, USA (2011), http://dx.doi.org/10.
               1109/ICDMW.2011.171
            9. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
               sentations in vector space. CoRR abs/1301.3781 (2013), http://arxiv.org/abs/
               1301.3781
           10. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed repre-
               sentations of words and phrases and their compositionality. CoRR abs/1310.4546
               (2013), http://arxiv.org/abs/1310.4546
           11. Řehůřek, R., Sojka, P.: Software Framework for Topic Modelling with Large Cor-
               pora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP
               Frameworks. pp. 45–50. ELRA, Valletta, Malta (May 2010), http://is.muni.cz/
               publication/884893/en
           12. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput.
               Surv. 34(1), 1–47 (Mar 2002), http://doi.acm.org/10.1145/505282.505283
           13. Wikipedia: Wikipedia spanish dumps (2017), https://dumps.wikimedia.org/
               eswiki/, [Online; accessed 18-May-2017]


                                                                                                                        60