J. Hlaváčová (Ed.): ITAT 2017 Proceedings, pp. 176–180 CEUR Workshop Proceedings Vol. 1885, ISSN 1613-0073, c 2017 T. Hercig, P. Krejzl, B. Hourová, J. Steinberger, L. Lenc Detecting Stance in Czech News Commentaries Tomáš Hercig1,2 , Peter Krejzl1 , Barbora Hourová1 , Josef Steinberger1 , Ladislav Lenc1,2 1Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia, Univerzitní 8, 306 14 Plzeň, Czech Republic 2 NTIS—New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia, Technická 8, 306 14 Plzeň, Czech Republic nlp.kiv.zcu.cz {tigi,krejzl,hourova,steinberger,llenc}@kiv.zcu.cz Abstract: This paper describes our system created to de- 2 Related Work tect stance in online discussions. The goal is to identify whether the author of a comment is in favor of the given The SemEval-2016 task Detecting Stance in Tweets1 [10] target or against. We created an extended corpus of Czech had two subtasks: supervised and weakly supervised news comments and evaluated a support vector machines stance identification. classifier, a maximum entropy classifier, and a convolu- The goal of both subtasks was to classify tweets into tional neural network. three classes (In favor, Against, and Neither). The per- formance was measured by the macro-averaged F1-score Keywords: Stance Detection, Opinion Mining, Sentiment of two classes (In favor and Against). This evaluation Analysis measure does not disregard the Neither class, because falsely labelling the Neither class as In favor or Against still affects the scores. We use the same evaluation metric 1 Introduction (F1_2), accuracy, and the F1-score of all classes (F1_3). The supervised task (subtask A) tested stance towards Stance detection has been defined as automatically deter- five targets: Atheism, Climate Change is a Real Concern, mining from text whether the author is in favor of the given Feminist Movement, Hillary Clinton, and Legalization of target entity (person, movement, topic, proposition, etc.), Abortion. Participants were provided with 2814 labeled against it, or whether neither inference is likely. training tweets for the five targets. A detailed distribution of stances for each target is given Stance detection can be viewed as a subtask of opinion in Table 1. The distribution is not uniform and there is mining, similar to sentiment analysis. In sentiment analy- always a preference towards a certain stance (e.g., 63% sis, systems determine whether a piece of text is positive, tweets about Atheism are labeled as Against). The distribu- negative, or neutral. However, in stance detection, systems tion reflects the real-world scenario, in which a majority of predict author’s favorability towards a given target, which people tend to take a similar stance. It also depends on the may not even be explicitly mentioned in the text. More- source of the data. For example, in the case of Legaliza- over, the text may express positive opinion about an entity tion of Abortion, we can assume that the distribution will contained in the text, but one can also infer that the au- be significantly different in religious communities than in thor is against the defined target (an entity or a topic). It atheistic communities. has been found difficult to infer stance towards a target of For the weakly supervised task (subtask B), there were interest from tweets that express opinion towards another no labeled training data but participants could use a large entity[10]. number of tweets related to the single target: Donald There are many applications which could benefit from Trump. the automatic stance detection, including information re- The best results for subtask A were achieved by an ad- trieval, textual entailment, or text summarization, in par- vanced baseline using SVM classifier with unigrams, bi- ticular opinion summarization. grams, and trigrams along with character n-grams (2, 3, 4, We created an extended corpus for stance detection for and 5-gram) as features. Czech and evaluate standard top-performing models on Wei et al. [12] present the best result for subtask B and this dataset and report the results. close second team in subtask A of the SemEval stance de- The rest of this paper is organized as follows. We sum- tection task. They used a convolutional neural network marise the releated work in Section 2). The creation of the (CNN) designed according to Kim [4]. It utilizes the same used corpus is covered by Section 3. Our approach is de- kernel widths and numbers of filters as proposed by Kim. scribed in Section 4. The convolutional neural network ar- Pre-trained word2vec embeddings are used for initializa- chitecture is depicted in Section 5. Evaluation and results tion of the embedding layer. The main difference from discussion is in Section 6 and future work is proposed in Section 7. 1 http://alt.qcri.org/semeval2016/task6/ Detecting Stance in Czech News Commentaries 177 Table 1: Statistics of the SemEval-2016 task corpora in terms of the number of tweets and stance labels. Target Entity Total In favor Against Neither Atheism 733 124 (17%) 464 (63%) 145 (20%) Climate Change is Concern 564 335 (59%) 26 (5%) 203 (36%) Feminist Movement 949 268 (28%) 511 (54%) 170 (18%) Hillary Clinton 934 157 (17%) 533 (57%) 244 (26%) Legalization of Abortion 883 151 (17%) 523 (59%) 209 (24%) All 4,063 1,035 (25%) 2,057 (51%) 971 (24%) Table 2: Statistics of the Czech corpora in terms of the number of news comments and stance labels. Target Entity Total In favor Against Neither “Miloš Zeman” – Czech president 2,638 691 (26%) 1,263 (48%) 684 (26%) “Smoking Ban in Restaurants” – Gold 1,388 272 (20%) 485 (35%) 631 (45%) “Smoking Ban in Restaurants” – All 2,785 744 (27%) 1,280 (46%) 761 (27%) Kim’s network is the used voting scheme. During each agreement (Cohen’s κ) was calculated between two anno- training epoch, several iterations are selected to predict tators on 2,203 comments. The final κ is 0.579 for “Miloš the test set. At the end of each epoch, the majority voting Zeman” (2,638 comments) and 0.423 for “Smoking Ban scheme is applied to determine the label for each sentence. in Restaurants” (2,785 comments). This is done over a specified number of epochs and finally The inter-annotator agreement for the target “Smoking the same voting is applied to the results of each epoch. The Ban in Restaurants” was quite low, thus we selected a sub- train and test data are separated according to the stance tar- set of the “Smoking Ban in Restaurants” part of dataset, gets. where the original two annotators assigned the same label The initial research on Czech data has been done as the gold dataset (1,388 comments). in [7]. They collected 1,460 comments from a Czech news The corpus is available for research purposes at http: server2 related to two topics – Czech president – “Miloš //nlp.kiv.zcu.cz/research/sentiment#stance. Zeman” (181 In favor, 165 Against, and 301 Neither) and “Smoking Ban in Restaurants” (168 In favor, 252 Against, and 393 Neither). 4 The Approach Overview The results with maximum entropy classifier were “Miloš Zeman” F1_23 = 0.435, F1_34 = 0.52 and “Smok- We evaluate common supervised classifiers, namely max- ing Ban in Restaurants” F1_23 = 0.456, F1_34 = 0.54. imum entropy classifier and support vector machines (SVM) classifiers from Brainy[6]. We also experimented with top-performing models for sentiment analysis and 3 Dataset stance detection in particular convolutional neural net- work. The models were trained separately for each target We extended the dataset from [7], nearly quadrupling its entity. size. The detailed annotation procedure was described in master thesis [3] in Czech. The whole corpus was anno- 4.1 Preprocessing tated by three native speakers. The distribution of stances for each target is given in Table 2. The same preprocessing has been done for all datasets. We The target entity “Miloš Zeman” part of the dataset use UDPipe [11] with Czech Universal Dependencies 1.2 was annotated by one annotator and then 302 comments models for tokenization, POS tagging and lemmatization. were also labeled by a second annotator to measure inter- Stemming has been done by the HPS stemmer [2]. Prelim- annotator agreement. The target entity “Smoking Ban in inary experiments have shown that lower-casing the data Restaurants” part of the dataset was independently anno- achieves slightly better results, thus all the experiments are tated by two annotators. To resolve conflicts a third anno- performed with lower-cased data. tator was used and then the majority voting scheme was applied to the gold label selection. The inter-annotator 4.2 Features 2 www.idnes.cz We selected features commonly used in similar natural 3 F1 – (In favor/Against) language processing tasks e.g. sentiment analysis. The 4 F1 – (In favor/Against/Neither) following baseline features were used: 178 T. Hercig, P. Krejzl, B. Hourová, J. Steinberger, L. Lenc 5 Convolutional Neural Network The architecture of the proposed CNN is depicted in Fig- ure 1. We use similar architecture to the one proposed in [8]. The input layer of the network receives a sequence of word indices from a dictionary. The input vector must be of a fixed length. We solve this issue by padding the input sequence to the maximum text length occurring in the train data denoted M. A special “PADDING” to- ken is used for this purpose. The embedding layer maps the word indices to the real-valued embedding vectors of length L. The convolutional layer consists of NC kernels containing k × 1 units and uses rectified linear unit (ReLU) activation function. The convolutional layer is followed by a max-pooling layer and dropout for regularization. The max-pooling layer takes maxima from patches of size (M −k +1)×1. The output of the max-pooling layer is fed into a fully-connected layer. Follows the output layer with 3 neurons which corresponds to the number of classes. It has softmax activation function. In our experimental setup we use the embedding dimen- sionality L = 300 and NC = 40 convolutional kernels with 5 × 1 units. The penultimate fully-connected layer con- tains 256 neurons. We train the network using adaptive moment estimation optimization algorithm [5] and cross- entropy is used as the loss function. Figure 1: Neural network architecture. 6 Results Character n-gram – Separate binary feature for each We used 20-fold cross-validation for models evaluation to character n-gram in the text. We do it separately for compensate the small size of dataset and to prevent over- different orders n ∈ {3, 5, 7}.5 fitting. For all experiments we report the macro-averaged F1- Bag of words – Word occurrences in the text. score of two classes F1_2 (In favor and Against) – the official metric for the SemEval-2016 stance detection Bag of adverbs – Bag of adverbs from the text. task[10], accuracy, and the macro-averaged F1-score of all Bag of adjectives – Bag of adjectives from the text. three classes (F1_3). Table 3 shows results for each dataset. CNN-1 is de- Negative emoticons – We used a list of negative emoti- scribed in Section 5 and CNN-2 is the architecture pro- cons6 specific to the news commentaries source. The posed in [4]. We achieved the best results on average with feature captures the presence of an emoticon within the maximum entropy classifier with the feature set con- the text. sisting of lemma unigrams, word shape, bag of adjectives, bag of adverbs, and character n-grams (n ∈ {3, 5, 7}). We Word shape – We assign words into one of 24 classes7 further performed ablation study of this combination of similar to the function specified in [1]. features. In Table 3 the bold numbers denote five best re- We experimented with additional features such as n- sults for given column and in the ablation study they de- grams, text length, etc. but using these features did not note features with no gain in the given column (i.e. feature lead to better results. Bag of words, adjectives and adverbs sets with no loss). use the word lemma or stem. We report results for various Both CNNs achieved good results, CNN-2 was slightly feature combinations and perform an ablation study of the better, this is not surprising as it was designed for senti- best feature set. ment analysis while CNN-1 was previously used for docu- ment classification. Surprisingly stem worked better than 5 Note that words e.g. emoticon “:-)” would be separated by spaces lemma as the word input for both neural networks. The ab- during tokenization resulting in “: - )”. 6 ":-(", ";-(", ":-/", "8-o", ";-e", ";-O", "Rv" lation study shows that word shape, bag of adjectives, and 7 We use edu.stanford.nlp.process.WordShapeClassifier [9] with the bag of adverbs features present little to no information gain WORDSHAPECHRIS1 setting. for the classifier, thus these features should be discarded or Detecting Stance in Czech News Commentaries 179 Table 3: Results on Czech stance detection datasets in %. We report accuracy (Acc), the macro-averaged F1-score of two classes (F1_2) and the macro-averaged F1-score of all three classes (F1_3). Feature set consists of lemma unigrams, word shape, bag of adjectives, bag of adverbs, and character n-grams (n ∈ {3, 5, 7}). The bold numbers denote five best results for given column and in the ablation study they denote features with no gain in the given column (i.e. feature sets with no loss). Zeman Smoking All Smoking Gold Classifier Features F1_3 F1_2 Acc F1_3 F1_2 Acc F1_3 F1_2 Acc SVM Random Class 32.7 34.6 33.4 32.4 34.4 33.0 31.2 27.2 32.2 SVM Majority Class 21.6 32.4 47.9 21.0 31.5 46.0 20.8 0.0 45.5 CNN-1 lemma 48.6 52.1 51.9 51.4 54.2 54.2 61.2 55.6 65.1 CNN-1 stemm 50.7 55.3 54.5 51.7 54.6 54.5 60.6 54.8 64.8 CNN-2 lemma 48.3 51.7 51.3 51.8 54.9 54.5 61.2 55.9 64.8 CNN-2 stemm 51.3 55.7 54.9 52.1 54.9 54.6 61.7 56.4 65.5 MaxEnt lemma 47.7 51.8 50.2 48.8 52.3 50.9 58.1 52.2 61.6 SVM lemma 46.7 52.0 50.7 50.4 55.3 53.8 60.1 54.5 63.5 MaxEnt stem 47.2 50.9 49.5 49.5 52.5 51.8 58.3 52.2 62.2 SVM stem 48.3 52.8 51.8 51.5 55.3 54.2 57.3 52.4 60.6 MaxEnt char. n-gram 3,5,7 50.4 55.7 53.7 50.3 54.9 53.1 61.6 56.8 65.0 SVM char. n-gram 3,5,7 47.4 53.4 52.2 51.3 57.2 54.9 57.6 53.2 60.8 MaxEnt shape 45.0 50.2 48.4 45.7 50.2 47.9 53.9 48.6 57.0 SVM shape 45.5 50.3 49.7 48.1 52.0 50.6 56.5 50.8 60.7 MaxEnt feature set 50.6 56.0 53.9 51.9 55.8 54.7 62.6 57.5 66.5 SVM feature set 47.9 54.3 52.7 52.6 58.2 56.0 59.8 55.3 62.9 MaxEnt feature set + emoticons 50.5 56.0 53.9 51.6 55.7 54.2 62.7 57.7 66.4 SVM feature set + emoticons 47.3 53.3 51.9 52.3 58.1 55.6 61.0 56.8 63.5 MaxEnt feature set + emoticons + stem 50.7 56.0 53.9 51.9 55.5 54.5 62.6 57.6 66.3 SVM feature set + emoticons + stem 47.7 53.5 52.2 51.6 57.4 55.0 60.6 55.6 64.1 MaxEnt feature set - shape 50.8 56.0 54.0 51.6 56.1 54.4 63.0 58.3 66.5 MaxEnt feature set - bag of adj. 50.7 56.1 54.0 51.8 55.4 54.4 62.7 57.7 66.4 MaxEnt feature set - bag of adv. 50.9 56.4 54.3 51.8 55.4 54.6 62.6 57.4 66.5 MaxEnt feature set - lemma 50.2 55.6 53.6 50.8 55.1 53.6 62.4 57.3 66.2 MaxEnt feature set - char. n-gram 3,5,7 46.9 51.7 49.9 48.6 52.3 50.9 58.1 52.3 61.8 readjusted to better capture the stance in comments. How- used for sentiment analysis and stance detection. We con- ever, the selected feature combination still performed rea- ducted feature ablation and concluded that more features sonably well. still need to be readjusted for this task. The best results for the target “Miloš Zeman” were The used features are very common in natural language achieved by CNN-2 in terms of accuracy and F1_3, processing, however even in the SemEval-2016 stance de- F1_2 was the highest for maximum entropy classifier with tection task, the best results were achieved by commonly lemma unigrams, word shape, bag of adjectives, and char- used features. This suggests that stance detection is still in acter n-grams. The entity “Smoking Ban in Restaurants” its infancy and more gain can be expected in the future as was best assessed by SVM with the selected feature set for researchers better understand this new task. all data and by maximum entropy classifier with the same In future work, we plan to extend the dataset to other do- feature set for the gold dataset. mains, include more target entities and comments, which Character n-grams alone present a strong baseline for will let us draw stronger conclusions and move the task this task. closer to the industrial expectations. Given that there are vast amounts of news comments related to highly dis- 7 Conclusion cussed topics, we will study stance summarization which should aim at identifying the most important arguments. The paper describes our system created to detect stance in Another interesting experiment would be supplementing online discussions. We evaluated top-performing models the dataset with sentiment annotation. 180 T. Hercig, P. Krejzl, B. Hourová, J. Steinberger, L. Lenc Acknowledgments [9] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David This publication was supported by the project LO1506 of McClosky. The Stanford CoreNLP natural lan- the Czech Ministry of Education, Youth and Sports un- guage processing toolkit. In Association for Compu- der the program NPU I., by Grant No. SGS-2016-018 tational Linguistics (ACL) System Demonstrations, Data and Software Engineering for Advanced Applica- pages 55–60, 2014. URL http://www.aclweb. tions and by project MediaGist, EU’s FP7 People Pro- org/anthology/P/P14/P14-5010. gramme (Marie Curie Actions), no. 630786. [10] Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob- hani, Xiaodan Zhu, and Colin Cherry. Semeval- References 2016 task 6: Detecting stance in tweets. In Proceed- ings of the 10th International Workshop on Seman- [1] Daniel M. Bikel, Scott Miller, Richard Schwartz, and tic Evaluation (SemEval-2016), pages 31–41, San Ralph Weischedel. Nymble: a high-performance Diego, California, June 2016. Association for Com- learning name-finder. In Proceedings of the fifth putational Linguistics. URL http://www.aclweb. conference on Applied natural language processing, org/anthology/S16-1003. pages 194–201. Association for Computational Lin- guistics, 1997. [11] Milan Straka, Jan Hajič, and Jana Straková. UD- Pipe: trainable pipeline for processing CoNLL-U [2] Tomáš Brychcín and Miloslav Konopík. Hps: High files performing tokenization, morphological anal- precision stemmer. Information Processing & Man- ysis, pos tagging and parsing. In Proceedings of agement, 51(1):68–91, 2015. the Tenth International Conference on Language Re- sources and Evaluation (LREC’16), Paris, France, [3] Barbora Hourová. Automatic detection of argumen- May 2016. European Language Resources Associa- tation. Master’s thesis, University of West Bohemia, tion (ELRA). ISBN 978-2-9517408-9-1. Faculty of Applied Sciences, 2017. [12] Wan Wei, Xiao Zhang, Xuqin Liu, Wei Chen, and [4] Yoon Kim. Convolutional neural networks for Tengjiao Wang. pkudblab at semeval-2016 task sentence classification. In Proceedings of the 6 : A specific convolutional neural network sys- 2014 Conference on Empirical Methods in Natural tem for effective stance detection. In Proceed- Language Processing (EMNLP), pages 1746–1751, ings of the 10th International Workshop on Seman- Doha, Qatar, October 2014. Association for Com- tic Evaluation (SemEval-2016), pages 384–388, San putational Linguistics. URL http://www.aclweb. Diego, California, June 2016. Association for Com- org/anthology/D14-1181. putational Linguistics. URL http://www.aclweb. [5] Diederik Kingma and Jimmy Ba. Adam: A org/anthology/S16-1062. method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [6] Michal Konkol. Brainy: A machine learning li- brary. In Leszek Rutkowski, Marcin Korytkowski, Rafal Scherer, Ryszard Tadeusiewicz, Lotfi Zadeh, and Jacek Zurada, editors, Artificial Intelligence and Soft Computing, volume 8468 of Lecture Notes in Computer Science, pages 490–499. Springer Interna- tional Publishing, 2014. ISBN 978-3-319-07175-6. [7] Peter Krejzl, Barbora Hourová, and Josef Stein- berger. Stance detection in online discussions. In Mária Bieliková and Ivan Srba, editors, WIKT & DaZ 2016 11th Workshop on Intelligent and Knowledge Oriented Technologies 35th Conference on Data and Knowledge, pages 211–214. Vydatel’stvo STU, Va- zovova 5, Bratislava, Slovakia, November 2016. ISBN 978-80-227-4619-9. [8] Ladislav Lenc and Pavel Král. Deep neural networks for czech multi-label document classification. CoRR, abs/1701.03849, 2017. URL http://arxiv.org/ abs/1701.03849.