Introduction

Neural models for StanceCat shared task at IberEval 2017

Luca Ambrosini

Giancarlo Nicolo

@inf.upv.es

2017

210 216

This paper describes our participation in the Stance and Gender Detection in Tweets on Catalan Independence (StanceCat) task at IberEval 2017. Our approach was focused on neural models, rstly using classical and speci c model from state of the art, then we introduce a new topology of convolutional network for text classi cation. The raising of social networks as worldwide means of communication and expression, is gaining lot of interest from company and academia, due to the huge availability of daily contents published by users. Focusing on academia perspective, especially in the Natural Language Processing eld, the contents available in form of written text are really useful for the study of speci c open problems, where the stance detection related to political events is an example, and the Stance and Gender Detection in Tweets on Catalan Independence (StanceCat) task at IberEval 2017 is a concrete application. In StanceCat, the principal aim is to automatically detect if the text's author is in favor of, against, or neutral towards the Catalan Independence. Moreover, as a secondary aim, participants are asked to infer the author's gender. To tackle the described problem we built a stance&gender detection system mainly decomposed in two modules: text pre-processing and classi cation model. During the system's tuning process, di erent design choices were explored trying to nd the best modules' combination and from their anlysis some interesting insight can be drawn. In the following sections we rstly describe the StanceCat task (Section 2), then we illustrate the module's design of developed stance&gender detection system (Section 3), after that, an evaluation of the tuning process for submitted systems is analysed (Section 4), nally, conclusion over the whole work are outlined (Section 5).

Introduction

Regarding the text pre-preprocessing, has to be mentioned that the corpus under observation can not be treated as proper written language, because computer-mediated communication (CMC) is highly informal, a ecting diamesic3 variation with creation of new items supposed to pertain lexicon and graphematic domains [ 7,8 ]. Therefore, our pre-processing follows two approaches: classic and microblogging related. As classic aproach we used stemming (i.e., ST), stopwords (i.e., SW) and punctuation removal (i.e., PR). For microblogging approach we focus our attention over the following items: (i) mentions (i.e., MT), (ii) smiley (i.e., SM), (iii) emoji (i.e., EM), (iv) hashtags (i.e., HT), (v) numbers (i.e., NUM), (vi) URL (i.e., URL) (vii) and Tweeter reserve-word as RT and FAV (i.e., RW). For each of these items we leave the possibility to be removed or substituted by constant string. In relation to above approaches we implement them using the following tools: (i) NLTK [ 4 ] and (ii) Preprocessor4. 3.2

Classi cation models

Following, we describe the neural models used for the classi cation module. Before introducing the models we describe the speci c text representation used as input layer Section 3.2 (i.e., sentence-matrix).

Text representation To represent the text we used word embeddings as described by [ 5 ], where words are represented as vectors of real number with xed dimension jvj. In this way a whole sentence s, with length jsj its number of word, is represented as a sentence-matrix M of dimension jM j = jsj jvj. jM j has to be xed a priori, therefore jsj and jvj have to be estimated. jvj was xed to 300 following [ 5 ]. jsj was estimated analyzing table 2, in details we decided to x it as the sum of average length plus the standard deviation (i.e. jsj = 17 for both language), with this choice input sentences longer than jsj are truncated, while shorter ones are padded with null vectors (i.e., a vector of all zeros).

Choosing words as elements to be mapped by the embedding function, raise some challenge over the function estimation related to data availability. In our case the available corpus is very small and estimated embeddings could lead to low performance. To solve this problem, we decided 3 The variation in a language across medium of communication (e.g. Spanish over the phone versus

Spanish over email) 4 Preprocessor is a preprocessing library for tweet data written in Python, https://github.com/s/preprocessor to used a pre-trained embeddings estimated over Wikipedia using a particular approach called fastText [ 5 ].

Convolutional Neural Network. Convolutional Neural Networks (CNN) are considered state of the art in many text classi cation problem. Therefore, we decide to use them in a simple architecture composed by a convolutional layer, followed by a Global Max Pooling layer and two dense layers.

Dilated KIM. This model is our new topology of CNN. It can be seen as an extension of Kim's model [ 2 ] using the dilation ideas from computer graphics eld [ 14 ].

The original Kim's model is a particular CNN where the convolutional layer has multiple lter widths and feature maps. The complete architecture is illustrated in Figure 1, here the input layer (i.e., sentence-matrix) is processed in a convolutional layer of multiple lters with di erent width, each of these results are fed into Max Pooling layers and nally the concatenation of them (previously atten to be dimensional coherent) is projected into a dense layer. Our extension is to use a dilated lters in combination with normal ones, the intuition is that normal lter capture adjacent words features, while dilated one are able to capture relations between non adjacent words. This behaviour can't be achieved by the original Kim's model, because, even though the lters size can be changed, they will capture only features from adjacent words.

Regarding the architectural references in [ 2 ], the lter's number jf j and their dimension (k; d), where k is the kernel size and d the dilation's unit, was optimized leading to the following results: jf j = 5; f1 = (2 jvj; 0); f2 = (2 jvj; 3); f3 = (3 jvj; 1); f4 = (5 jvj; 1); f5 = (7 jvj; 1). Recurrent neural network. Long Short Term Memory (LSTM) and Bidirectional LSTM are types of Recurrent Neural Network (RNN) aiming at capture dynamic temporal behaviour. This behaviour suggest us to use them for the stance detection, in particular we use straightforward architectures made of an embedded input layer followed by an LSTM layer of 128 units, terminated by a dense layer for both normal and bidirectional models. 4

Evaluation

In this section we are going to illustrate the evaluation of developed systems regarding the modules design reported in section 3. First we illustrate the metric proposed by organizers for system's evaluation (Section 4.1), then we outline empirical results produced by a 10-fold cross validation over the given data set (Section 4.2), nally we report our performance at the shared task (Section 4.3). 4.1

Metrics

System evaluation metrics were given by the organizers and reported here in the following equations (1) to (6). Their choice was to use an F1 macro measure for stance detection, due to class unbalance, while a categorical accuracy for the gender detection.

Gender = accuracy =

P T P + P T N

P sample (1)

Stance =

F1 macro(F avor) + F1 macro(Against) 2 (2) (5) (6) precision recall precision + recall (3) (4) precision = recall = 1

X P r(yl; y^l) L j j l2L 1

X R(yl; y^l) L j j l2L where L is the set of classes, yl is the set of correct label and y^l is the set of predicted labels. 4.2

Fine tuning process

Following, we describe the ne tuning process of our proposed model over possible combinations of pre-processing (Table 3), then we compare Kim's model against our extension (Table 4) and nally report the improvement over the use of a data augmentation technique (Table 5). For brevity of information only the evaluation of Dilated Kim's model over Spanish stance detection is reported, in details, the results are calculated from averaging three runs of a 10-fold cross validation over the complete data set. Nevertheless, the results obtained after the ne tuning process for all the models are reported in section 4.3, where their development performances are compared against the ones obtained in the StanceCat task.

Notations used in Table 3 refer to the one introduced in Section 3.1, where the listing of a notation means its use for the reported result. Regarding the tweet speci c pre-processing, all the items have been substituted, with the exception for URL and RW that have been removed. We report the contribution of each analysed pre-processing alone. { Most of common used preprocessing decrease model's performance, meaning that their information can be directly exploited by the model { Only stemming and mention removals brought small improvements, therefore they will be used as best tuning for our proposed model. ES

Gender ES

CA Kim 0:624( 0:017) 0:630( 0:022) 0:634( 0:011) 0:655( 0:017)

Dilated Kim 0:658( 0:039) 0:659( 0:028) 0:652( 0:013) 0:715( 0:015)

From the analysis of Table 4 we can outline a signi cant performance's improvement of our proposed model respect the original Kim's model.

Due to the fact that our development data set has few samples, to train our models we decided to apply a data augmentation technique that didn't rely over external data rather exploit the word embedding text representation. In details, we applied Gaussian noise to word embeddings and after every convolutional layers, and, to further improve performances, we take advantage of batch normalization. Best results were archived with additive zero centered noise, with a standard deviation of 0.2. Regarding the batch normalization layers, default keras parameters were used. Results of this technique respect the Dilated Kim's model are reported in table 5. 4.3

Competition results

For the system's submission, participants were allowed to send more than a model till a maximum of 5 possible runs, therefore in tables 6 and 7 we report our best performing systems (tuned following the process in section 4.2) for the StanceCat shared task.

Unfortunately, due to a submission error caught only after the o cial results were published, we didn't manage to be properly evaluated (in tables 6 and 7 the submitted models with name in parentheses and asterisk), therefore after the closing we asked organizers to evaluate some of our model to see how they would had performed (the models with test results in bold). In this paper we have presented our participation in the Stance and Gender Detection in Tweets on Catalan Independence (StanceCat) task at IberEval2017. Five distinct neural models were explored, in combination with di erent types of preprocessing. From the ne tuning process we System System derived that most of well know pre-processing technique are strongly model dependent, meaning that the preprocessing pipeline has to be optimized depending on the classi er. Finally, our proposal of a dilation technique for NLP task, the Dilated Kim's model, seems to increase performances of CNN base classi ers.

1. Taule

, Mart

M.A.

, Rangel

, Rosso

, Bosco

, Patti

. Overview of the task of Stance and Gender Detection in Tweets on Catalan Independence at IBEREVAL 2017 . In: Notebook Papers of Workshop on SEPLN 2nd Workshop on Evaluation of Human Language Technologies for Iberian Languages, (IBEREVAL) , Murcia, Spain, September 19, CEUR Workshop Proceedings. CEUR-WS.org , 2017 .

2. Kim , Yoon. \ Convolutional neural networks for sentence classi cation . " arXiv preprint arXiv:1408.5882 ( 2014 ).

3. Joulin , Armand , et al. \ Bag of tricks for e cient text classi cation . " arXiv preprint arXiv:1607.01759 ( 2016 ).

Edward

Loper and

Steven

Bird . 2002 . NLTK: the Natural Language Toolkit . In Proceedings of the ACL-02 Workshop on E ective tools and methodologies for teaching natural language processing and computational linguistics - Volume 1 (ETMTNLP '02) , Vol. 1 . Association for Computational Linguistics, Stroudsburg, PA, USA, 63 - 70 . DOI=http://dx.doi.org/10.3115/1118108.1118117

5. Bojanowski , Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas. \ Enriching Word Vectors with Subword Information" arXiv preprint arXiv:1607.04606 ( 2016 ).

6. Harris , Zellig S. \ Distributional structure." Word 10 . 2 - 3 ( 1954 ): 146 - 162 .

7. Bazzanella , Carla. \ Oscillazioni di informalita e formalita: scritto, parlato e rete." Formale e informale. La variazione di registro nella comunicazione elettronica . Roma: Carocci ( 2011 ): 68 - 83 .

8. Cerruti , Massimo, and Cristina Onesti. \ Netspeak: a language variety? Some remarks from an Italian sociolinguistic perspective." Languages go web: Standard and non-standard languages on the Internet ( 2013 ): 23 - 39 .

9. Zhang, Ye, and Byron Wallace. \A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classi cation . " arXiv preprint arXiv:1510.03820 ( 2015 ).

10. C. Bosco , M.

Lai , V.

Patti , F.

Rangel , P.

Rosso ( 2016 ) Tweeting in the Debate about Catalan Elections . In: Proc. LREC workshop on Emotion and Sentiment Analysis Workshop (ESA) , LREC2016, Portoroz, Slovenia, May 23 - 28 , pp. 67 - 70 .

11.

Rangel ,

Rosso ,

Verhoeven ,

Daelemans ,

Potthast ,

Stein ( 2016 ) Overview of the 4th Author Pro ling Task at PAN 2016: Cross-Genre Evaluations . In: Balog

, Cappellato

, Ferro

, Macdonald

. (Eds.) CLEF 2016 Labs and Workshops, Notebook Papers . CEUR Workshop Proceedings. CEUR-WS.org , vol. 1609 , pp. 750 - 784 .

12. Mohammad , Saif

Parinaz

Sobhani , and Svetlana Kiritchenko. \Stance and sentiment in tweets." arXiv preprint arXiv:1605.01655 ( 2016 ).

13. Mohammad , Saif

, et al. \ Semeval-2016 task 6: Detecting stance in tweets." Proceedings of SemEval 16 ( 2016 ).

14. Fisher

, Vladlen Koltun , \Multi-Scale Context Aggregation by Dilated Convolutions"