Overview of CLEF 2019 Lab ProtestNews: Extracting Protests from News in a Cross-context Setting Ali Hürriyetoğlu, Erdem Yörük, Deniz Yüret, Çağrı Yoltar, Burak Gürel, Fırat Duruşan, Osman Mutlu, and Arda Akdemir Koc University, Istanbul 34450, Turkey {ahurriyetoglu,eryoruk,dyuret,cyoltar,bgurel,fdurusan,omutlu, aakdemir}@ku.edu.tr http://www.ku.edu.tr Abstract. We present an overview of the CLEF-2019 Lab ProtestNews on Extracting Protests from News in the context of generalizable nat- ural language processing. The lab consists of document, sentence, and token level information classification and extraction tasks that were re- ferred as task 1, task 2, and task 3 respectively in the scope of this lab. The tasks required the participants to identify protest relevant informa- tion from English local news at one or more aforementioned levels in a cross-context setting, which is cross-country in the scope of this lab. The training and development data were collected from India and test data was collected from India and China. The lab attracted 58 teams to participate in the lab. 12 and 9 of these teams submitted results and working notes respectively. We have observed neural networks yield the best results and the performance drops significantly for majority of the submissions in the cross-country setting, which is China. Keywords: natural language processing · information retrieval · ma- chine learning · text classification · information extraction · event ex- traction · computational social science · generalizability 1 Introduction We describe a realization of our task set proposal [4] in the scope of CLEF-2019 Lab ProtestNews.1,2 The task set aims at facilitating development of generaliz- able natural language processing (NLP) tools that are robust in a cross-context setting, which is cross-country in this lab. Since the performance of NLP tools Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem- ber 2019, Lugano, Switzerland. 1 http://clef2019.clef-initiative.eu/ 2 https://emw.ku.edu.tr/clef-protestnews-2019/ significantly drop in a context different from the one they are created and vali- dated [1, 2, 6], measuring and improving state-of-the-art NLP tool development methodology is the primary aim of our efforts. Comparative social and political science studies facilitate protest informa- tion to analyze cross-country similarities, differences, and effect of these actions. Therefore our lab focuses on classifying and extracting protest event informa- tion in English local news articles from India and China. We believe our efforts will contribute to enhance the methodologies applied to collect data for these studies. This need was motivated based on the recent results that shows NLP tools, those of text classification and information extraction, have not been sat- isfactory against the requirements of longer time coverage and working on data from multiple countries [7, 3]. This first iteration of our lab attracted 58 teams from all around the world. 12 of these teams submitted their results to one or more tasks on the CodaLab page of the lab.3 9 teams described their approach in terms of a working note. We introduce the task set we tackle, the corpus we have been creating, and the evaluation methodology in Sections 2, 3, and 4 respectively. We report the results in Section 5 and conclude our report in Section 6. 2 Task Set The lab consists of the tasks document classification, event sentence detection and event extraction, which are referred as task 1, task 2, and task 3 respectively, as demonstrated in Figure 1. The document classification task, which is task 1, requires predicting whether a news article report at least one protest event that has happened or is happening. It is a binary classification task that require to predict whether a news article label should be 1 (positive) or 0 (negative). The sentences that contain any event trigger should be identified in task 2, which is event sentence detection task. Sentence labels are 0 and 1 as well. This task could be handled either as classification or extraction task as we provide order of the sentences in their respective articles. Finally, the event triggers and event information, which are place, facility, time, organizer, participant, and target, should be extracted in task 3. This order of tasks provides a controlled setting that enables error analysis and optimization possibility during annotation and tool development efforts. Moreover, this design enable analyzing steps of the analysis that contributes to explainability of the automated tool predictions. 3 Data We provide the number of instances for each task in Table 1 in terms of training, development, test 1, and test 2 data. The training and development data was collected from online local English news from India. Test 1 and test 2 data refer to data from India and China respectively. 3 https://competitions.codalab.org/competitions/22349 Event c c Participant News Target Protests Protest sentence(s) Place Time ... Fig. 1. The lab consists of a) Task 1: Document classification, b) Task 2: Event sentence detection, and c) Task 3: Event extraction. Table 1. Number of instances for each task Training Development Test 1 (India) Test 2 (China) Task 1 3,429 456 686 1,800 Task 2 5,884 662 1,106 1,234 Task 3 250 36 80 39 A sample from task 1 contains the news article’s text, its URL and its label that is assigned by the annotation team. For task 2: sentences, their labels, their order in the article’s text they belong, and the URL of their article are available in samples. The release format of the data for task 2 enable participants to treat this task either as classification of individual sentences or extracting event relevant sentences from a document. The data for task 3 consists of snippets that contain one or more event trigger that refer to the same event. Multiple sentences may occur in a snippet in case these sentences refer to the same event.4 The tokens in these snippets are annotated using IOB, inside, outside, beginning, scheme. The examples of data is provided in Figure 2. There is not any overlap of news articles across tasks. This separation was required in order to avoid any misuse of data from one task to infer the labels for another task without any effort. 3.1 Distribution We distributed the data set in a way that does not violate copyright of the news sources. This involves only sharing information that is needed to reproduce the corpus from the source for task 1 and task 2 and only relevant snippets for task 3. We released a Docker image that contains the toolbox5 required to reproduce the news articles on the computer of a participant. The toolbox generates a log of the process that reproduce the data set and we have requested these log files from the participants. The toolbox is a pipeline that scrapes HTMLs, converts HTMLs to text and finally performs specific filling operation for each of task 1 and task 2. To the best of our knowledge, the toolbox succeeded in enabling participants create the data set on their computers. Only one participant from 4 Snippets we share contain information about only a single event. 5 https://github.com/emerging-welfare/ProtestNews-2019 {"text": "... Police suspect that the panchayat members, including the Salwa Judum leader, were abducted and killed by Maoist rebels, who had left the including O bodies near the village. Meanwhile, security forces and Naxalites had an the O encounter near village Belgaon. ...", "label":1} Salwa B-target Judum I-target leader I-target , O A sample from data set for task 1. were O abducted B-trigger and I-trigger killed I-trigger by O {"sentence": "Police suspect that the panchayat members, including the Maoist B-participant Salwa Judum leader, were abducted and killed by Maoist rebels, who had rebels I-participant left the bodies near the village.", "sentence_number":3, "label": 1}, {"sentence": "Meanwhile, security forces and Naxalites had an encounter near village Belgaon.", "sentence_number":4, "label": 1} Annotations in IOB scheme. Corresponding sentence samples for the article above. Fig. 2. Data samples for task 1, task 2, and task 3 Iran was not able to download the news articles due to restrictions to access online content that are specific to his geolocation. 4 Evaluation Setting We use macro averaged F1 due to class imbalance present in our data for eval- uating the task 1 and task 2. The event extraction task, which is task 3, was evaluated on the average F1 score of all information types that was based on the ratio of the full match between the prediction and the annotations in the test sets, using a python implementation6 of CoNLL 2003 shared task [5] evaluation script. We performed two levels of evaluation that were on data from the source country (Test 1) and from target country (Test 2). The participants were in- formed only about labels of the training and development data from the source country. They did not see labels of any test set. The number of allowed sub- missions and the period the participants can submit their predictions for Test 1 and Test 2 was determined in a way that restrict the possibility of over-fitting on test data. We limited the number of submission and the submission period in order to make sure the participants do not overfit to the test data based. We applied three cycles of evaluation. Participants could submit unlimited number of results without being able to see their score. Their last submitted results’ scores were announced at the end of each evaluation cycle. First and second cycles aimed at providing feedback to the participants. The third and final cycle was the deadline for submitting results. Finally, we provided a baseline submission for task 1 and task 2 in order to guide the participants. This baseline was based on predictions of the the best scoring machine learning model among Support Vector Machines, Naive 6 https://github.com/sighsmile/conlleval Bayes, Rocchio classifier, Ridge Classifier, Perceptron, Passive Aggressive Clas- sifier, Random Forest, K Nearest Neighbors, and Elastic Net on development set. The best scoring model was a linear support vector machines classifier that was trained using stochastic gradient descent. 5 Results We facilitated CodaLab platform for managing the submissions and maintain a leaderboard.7 The leaderboard for task 1 and task 2 is presented in Table 2. The column names have the following format Test -, e.g. Test 1-1 stands for the Test 1 of task 1. The results for ProtestLab Baseline is the aforementioned baseline that was submitted by us. Table 2. Results that are ranked based on average F1 scores of Test 1 and Test 2 for task 1 and task 2 Test 1-1 Test 1-2 Test 2-1 Test 2-2 Avg ASafaya .81 .63 .70 .60 .69 PrettyCrocodile .79 .60 .65 .64 .67 LevelUp Research .83 .65 .66 .45 .65 Provos RUG .80 .59 .63 .55 .64 GS .78 .56 .64 .58 .64 Be-LISI .76 .50 .58 .30 .54 ProtestLab Baseline .83 .49 .58 .20 .52 CIC-NLP .59 .50 .52 .34 .49 SSNCSE1 .38 .15 .56 .35 .36 iAmirSoltani .69 .36 - - .26 Sayeed Salam .55 .28 - - .20 SEDA lab .58 - .15 .02 .19 The summary of the approaches and results of each team that participated in task 1 and or task 2 are provided below.8 ASafaya (Sakarya University) submitted the best results for task 2 and for average of task 1 and task 2 using Bidirectional Gated Recurrrent Unit (GRU) based model. Although this model perform the best on average, the performance of this model drops across the context significantly. PrettyCrocodile (National Research University HSE) submitted the sec- ond best average results that were predicted using Embeddings from Lan- guage Models (ELMo). The performance of the model is comparable in the cross-context setting for task 2. 7 https://competitions.codalab.org/competitions/22349#results 8 We have not received details of the submissions from CIC-NLP, iAmirSoltani, and Sayeed Salam. The details of other approaches can be found in the respective working notes that were published in proceedings of CLEF 2019 Lab ProtestNews. LevelUp Research (University of North Carolina at Charlotte) has ap- plied multi-task learning based on LSTM units using word embeddings from a pre-trained FastText model. This method yielded the best results for task 1. Provos RUG (University of Groningen) has implemented a feature based stacked ensemble model based on FastText embeddings and a set of different basic Logistic Regression classifiers that enabled their predictions to rank fourth among the participating teams. GS (University of Bremen) has stacked the word embeddings such as GloVe and FastText together with the contextualized embeddings generated from Flair language models (LM). This approach was ranked fourth in general and third for task 2. Be-LISI (Universit de Carthage) combined the logistic regression with lin- guistic processing and expansion of the text with related terms using word embedding similarity. This approach marked a significant difference in terms of overall performance, which is the drop from .64 to .54, in comparison to higher ranked submissions. SSNCSE1 (Sri Sivasubramaniya College of Engineering) reported results of their bi-directional LSTM that applies Bahdanau, Normed-Bahdanau, Luong, and Scaled-Luong attentions. The submission that uses Bahdanau attention yielded the results reported in Table 2. SEDA lab (University of Exeter) applied support vector machines and XG- Boost classifiers that are combined with various word embedding approaches. Results of this submission showed promising performance in terms of preci- sion on both document and sentence classification tasks. We analyze task 3 results separate from task 1 and task 2 as it differs from them. The F1 scores for task 3 are presented in Table 2. Table 3. Results that are ranked based on average score of Test 1 and Test 2 for Task 3 Test 3-1 Test 3-2 Avg GS .604 .532 .568 DeepNEAT .601 .513 .557 Provos RUG .600 .456 .528 PrettyCrocodile .524 .425 .474 LevelUp Research .516 .394 .455 GS (University of Bremen) submitted the best results for task 3 using a BiLSTM-CRF model incorporating pooled contextualized flair embeddings, and their model was the best in generalizing. DeepNEAT (FloodTags & Radboud University) compares the submitted ELMO+BiLSTM model to a traditional CRF and shows that the former is better and more generalizable. Provos RUG (University of Groningen) divides the task 3 into two as event trigger detection task and event argument detection task using BiLSTM- CRF model with word embeddings, POS embeddings, and character-level embeddings for both subtasks. He further extends the features for latter subtask with learned embeddings for dependency relations and event trig- gers. PrettyCrocodile (National Research University HSE) makes use of ELMO embeddings with different architectures, achieving her best score for task 3 using a BiLSTM. LevelUp Research (University of North Carolina at Charlotte) implemented a multi-task neural model that require a time-ordered sequence of word vec- tors representative of a document or sentence. The LSTM layer has been replaced by a layer of bidirectional gated recurrent units (GRU). 6 Conclusion The results show how text classification and information extraction tool perfor- mances drops between two contexts. The scores on data from the target country are significantly lower than on data from the source country. Only the Pret- tyCrocodile team performed comparatively well across contexts for task 2. Al- though it is not the best scoring system for neither task 1 nor task 2, Pretty- Crocodile team’s approach show some promise toward tackling the generalizabil- ity of NLP tools. The generalization of automated tools is an issue that has recently attracted much attention.9 However, as we have determined in our lab, generalizability is still a challenge for state-of-the-art methodology. Consequently, we will con- tinue our efforts by repeating this practice and extending the data and will be adding data from new countries and languages to our setting. The next itera- tion will run in the scope of the Workshop on Challenges and Opportunities in Automated Coding of COntentious Political Events (Cope 2019) at European Symposium Series on Societal Challenges in Computational Social Science (Euro CSS 2019).10,11,12 Acknowledgments This work is funded by the European Research Council (ERC) Starting Grant 714868 awarded to Dr. Erdem Yörük for his project Emerging Welfare. We are grateful to our steering committee members for the CLEF 2019 lab Sophia Ana- niadou, Antal van den Bosch, Kemal Oflazer, Arzucan Özgür, Aline Villavicen- cio, and Hristo Tanev. Finally, we thank to Theresa Gessler and Peter Makarov 9 https://sites.google.com/view/icml2019-generalization/cfp 10 https://competitions.codalab.org/competitions/22842 11 https://emw.ku.edu.tr/?event=challenges-and-opportunities-in-automated-coding- of-contentious-political-events&event date=2019-09-02 12 http://symposium.computationalsocialscience.eu/2019/ for their contribution in organizing the CLEF lab by reviewing the annotation manuals and sharing their work with us respectively. References 1. Akdemir, A., Hürriyetoğlu, A., Yörük, E., Gürel, B., Yoltar, c., Yüret, D.: To- wards Generalizable Place Name Recognition Systems: Analysis and Enhance- ment of NER Systems on English News from India. In: Proceedings of the 12th Workshop on Geographic Information Retrieval. pp. 8:1–8:10. GIR’18, ACM, New York, NY, USA (2018). https://doi.org/10.1145/3281354.3281363, http://doi.acm.org/10.1145/3281354.3281363 2. Ettinger, A., Rao, S., Daumé III, H., Bender, E.M.: Towards Linguistically Gener- alizable NLP Systems: A Workshop and Shared Task. In: Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems. pp. 1–10. Associa- tion for Computational Linguistics (2017), http://aclweb.org/anthology/W17-5401 3. Hammond, J., Weidmann, N.B.: Using machine-coded event data for the micro-level study of political violence. Research & Politics 1(2), 2053168014539924 (2014). https://doi.org/10.1177/2053168014539924, https://doi.org/10.1177/2053168014539924 4. Hürriyetoğlu, A., Yörük, E., Yüret, D., Yoltar, Ç., Gürel, B., Duruşan, F., Mutlu, O.: A task set proposal for automatic protest information collection across multiple countries. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds.) Advances in Information Retrieval. pp. 316–323. Springer International Publishing, Cham (2019) 5. Sang, E.F., De Meulder, F.: Introduction to the conll-2003 shared task: Language- independent named entity recognition. arXiv preprint cs/0306050 (2003) 6. Soboroff, I., Ferro, N., Fuhr, N.: Report on GLARE 2018: 1st Workshop on Generalization in Information Retrieval: Can We Predict Performance in New Domains? SIGIR Forum 52(2), 132–137 (2018), http://sigir.org/wp- content/uploads/2019/01/p132.pdf 7. Wang, W., Kennedy, R., Lazer, D., Ramakrishnan, N.: Grow- ing pains for global monitoring of societal events. Science 353(6307), 1502–1503 (2016). https://doi.org/10.1126/science.aaf6758, http://science.sciencemag.org/content/353/6307/1502