Find Problems before They Find You with AnnotatorPro’s Monitoring Functionalities Mohammed R. H. Qwaider, Anne-Lyse Minard, Manuela Speranza, Bernardo Magnini Fondazione Bruno Kessler, Trento, Italy {qwaider,minard,manspera,magnini}@fbk.eu Abstract with a rich apparatus of functionalities (e.g. an- notation, visualization, monitoring and reporting), English. We present a tool for annota- able to support and monitor a large variety of an- tion of linguistic data. A NNOTATOR P RO notators (i.e. from linguists to mechanical turk- offers both complete monitoring function- ers), flexible enough to serve a large spectrum alities (e.g. inter-annotator agreement and of annotation scenarios (e.g. crowdsourcing and agreement with respect to a gold standard) paid professional annotators), and open to the in- and highly flexible task design (e.g. token tegration of NLP tools (e.g. for automatic pre- and document level annotation, adjudica- annotation and for instance selection based on Ac- tion and reconciliation procedures). We tive Learning). teste A NNOTATOR P RO in several indus- Although there is a large supply of annotation trial annotation scenarios, coupled with tools, such as brat (Stenetorp et al., 2012), GATE Active Learning techniques. (Cunningham et al., 2011), CAT (Bartalesi Lenzi Italiano. Presentiamo uno strumento per et al., 2012), and WebAnno (Yimam et al., 2013), l’annotazione di dati linguistici. Annota- and several functions are included in common torPro offre sia complete funzionalità di crowdsourcing platforms (e.g. CrowdFlower1 ), monitoraggio (es. accordo tra annotatori, we believe that none of the available tool possesses accordo rispetto ad un gold standard), sia the full range of functionalities for a real and in- la alta flessibilità nel definire task di anno- tensive industrial use. As an example, none of the tazione (per esempio, annotazione per pa- afore mentioned tools allows one to implement ad- role o per documento, procedure di aggiu- judication rules (i.e. under what condition an item dicamento e re-conciliazione). Annotator- annotated by more than one annotator is assigned Pro è stato sperimentato in diversi scenari to a certain category) or to visualize items with di annotazione industriali, accoppiato con disagreement among annotators. tecniche di Active Learning. This paper introduces A NNOTATOR P RO, a new annotation tool which was mainly conceived to fulfill the above-mentioned needs. We highlight 1 Introduction two main aspects of the tool: (i) a high level of flexibility to design the annotation task, including Driven by the popularity of machine learning ap- the possibility to define adjudication and reconcil- proaches, there has been in the last years an in- iation procedures; (ii) the rich set of functionalities creasing need to produce human annotated data for allowing for constant monitoring of the quality of a large number of linguistic tasks (e.g. named en- the data being annotated. tity recognition, semantic role labeling, sentiment The paper is organized as follows. In Section 2 analysis, word sense disambiguation, and dis- we compare A NNOTATOR P RO with some state-of- course relations, just to mention a few). Datasets the-art annotation tools. Section 3 provides a gen- (development, training and test data) are being de- eral description of the tool. Sections 4 and 5 focus veloped for different languages and different do- on the task design and on the monitoring function- mains, both for research and industrial purposes. alities, while Section 6 provides a brief overview A relevant consequence of this is the increas- of the tool’s application and future extensions. ing demand for annotated datasets, both in terms 1 of quantity and quality. This in turn calls for tools https://www.crowdflower.com 2 Related Work tive translations), and word alignment (Girardi et al., 2014). Many annotation tools are available to the com- A NNOTATOR P RO inherits from MT-EQ UA L munity. However, some of them are limited by the capability of scaling over big data in an op- license, e.g. CAT (Bartalesi Lenzi et al., 2012) and timized platform that is able to save annotation in GATE (Cunningham et al., 2011) are available for real-time. It also makes use of the MT-EQ UA L research use only, while some others have open li- web-based interface which is a multi-user and censes, e.g. brat (Stenetorp et al., 2012), but offer user-friendly interface. limited features. It performs simple tokenization based on The brat rapid annotation tool (brat) is an spaces, punctuation, and other language- open license annotation tool that supports differ- dependent rules, but the user can also upload ent annotation levels, in particular annotation at directly tokenized files. the token level and annotation of relations between marked tokens. It supports multiple annotators, We designed new functionalities to fulfill the in the sense that many annotators can collaborate requirements of high quality corpus annotation on annotating the same corpus, but needs an in- performed by multiple annotators. A NNOTATOR - house installation. Despite all these advantages, P RO’s main novel features are: brat does not support either annotation monitoring or annotator/task reports. • The interface includes different options to de- Other tools (e.g. CAT) provide advanced func- sign the annotation task (Section 4.1), which tionalities to perform annotation at different lev- are set by the project manager. els (e.g. token and relation level) through a user- friendly interface, although they do not support an- • The tool enables annotation at two levels notation monitoring. (Section 4.2): annotation at the token level CrowdFlower is an outsourcing annotation ser- (e.g. part-of-speech tagging and named entity vice that provides a platform for annotation (fo- recognition) and annotation at the document cusing on annotation at the document level) em- level (e.g. sentiment analysis). ploying non expert contributors. It uses gold • A NNOTATOR P RO’s interface offers function- standard tests to evaluate the annotators and sup- alities for annotation monitoring (Section ports automatic adjudication features, but no inter- 5), which include inter-annotator agreement annotator agreement metrics are available. In ad- (IAA) monitoring and quality monitoring. dition an important issue which could limit the use of outsourcing is the non in-house storage of the data, in particular when sensitive data covered by A NNOTATOR P RO has been implemented in privacy regulations are concerned. PHP and JavaScript, and uses MySQL to manage GATE is a powerful tool that implements most a database. It takes as input several UTF-8 en- of the features to facilitate the annotation produc- coded formats: TXT (raw text), IOB22 and TSV tion in all its phases (e.g. task creation, annota- (tab separated values). It also accepts ZIP archives tor assignment, annotation monitoring and multi- containing the source files. layer annotation of the same corpus). However, As regards data storage, document’s annota- visualization of disagreement is not available and tions are saved in a MySQL database in real time no automatic adjudication is available. (i.e. while data being annotated). The annotated data can be exported in the following formats: 3 Overall Description IOB2 and TSV. A NNOTATOR P RO is a web-based annotation tool built on top of the open source tool MT-EQ UA L 4 Annotation Task Design (Machine Translation Error Quality Alignment), A NNOTATOR P RO distinguishes two types of a toolkit for the manual assessment of Machine users, i.e. managers and annotators. Managers Translation output that implements three different 2 tasks in an integrated environment: annotation of The IOB2 tagging format is a common format for text chunking. B- is used to tag the beginning of a chunk, I- to tag translation errors, translation quality rating (e.g. tokens inside the chunk and O to indicate tokens not belong- adequacy and fluency, relative ranking of alterna- ing to a chunk. Figure 1: Annotator’s task definition: annotation level, task’s name, task description, and annotation categories. Figure 2: An example annotation interface: sentiment annotation of tweets. take care of designing the annotation task at hand; • Defining the automatic adjudication rules in particular, they (i) define the annotation proce- in the case where multiple annotations of dure, which depends on the number of annotators, the same data are collected (document level their level of expertise (for example, non-expert only). The two basic options are: annotators might not be allowed to see/modify each other’s work) and the use that the dataset is – considering an annotation as solved if intended for (e.g. evaluation, training, etc.), and the majority of annotators agreed on a (ii) the annotator’s task, which includes selecting certain annotation; the most appropriate annotation level and creat- – considering an annotation as solved if a ing the annotation categories/labels (Figure 1). As minimum number of concordant anno- opposed to managers, annotators are basic users, tations is reached. who only have access to a limited number of (an- notation) functionalities (Figure 2). • Deciding whether to make the metadata of the documents (e.g. document id, document 4.1 Annotation Procedure title) visible to the annotators during the an- One of the main tasks of the manager is to define notation phase. the annotation procedure, which consists mainly • Deciding whether to allow for a revision of: phase after the annotation has been con- • Defining the number of annotators (one or cluded, i.e. give the annotators the possibility more) who can collaborate on annotating the to modify their annotations, for example after same corpus. a reconciliation step has taken place. By de- fault, document metadata will be visible dur- • In case of multiple annotators, defining ing the revision phase to facilitate the work. the type of collaboration among them, i.e. whether data are to be annotated only by one • Decide the modality for the selection of data or more of them (document level only). to be presented to the annotators: – propose to the annotator preselected or- annotated by the required number of annotators, dered documents (default option); independently of whether annotators did or did not – randomly select documents from a large reach an agreement). dataset; 5.2 Inter-Annotator Agreement Monitoring – select documents from a large dataset through an Active Learning process.3 IAA monitoring, which measures the level of agreement between the annotators at regular inter- 4.2 Annotator’s Task vals, is activated every time two or more annota- A NNOTATOR P RO supports two different annota- tors annotate the same data. tion levels, i.e one where annotation is performed IAA agreement is computed in terms of Dice at the document level and one where we have coefficient (Lin, 1998) and Cohen’s Kappa (Viera smaller units, typically tokens, being annotated. It and Garrett, 2005); the latter represents the agree- is the manager’s task to select the most appropri- ment as a continuous value from -1 to 1, where -1 ate annotation level for the task at hand; for exam- means total disagreement and 1 means total agree- ple, named entity recognition needs data annotated ment. at the token level, whereas for sentiment analysis The project manager has access to different a corpus is generally annotated at the document types of information to constantly monitor the level. level of agreement between annotators, focusing Finally, the task manager defines the set of cat- both on a single annotator and overall: egories or the set of labels to be used by the an- notator respectively to classify the documents (in • the level of agreement each annotator obtains the case of document level annotation) or to mark with every other annotator and the average of portion of text. the IAA values obtained by each annotator; 5 Annotation Monitoring • the overall average IAA. In A NNOTATOR P RO we have implemented several A NNOTATOR P RO also provides a visualization monitoring functionalities aimed at guaranteeing of the annotations made by each annotator for high quality annotation as described below. each document, where a different color is used to present each tag from the tagset (see Figure 3). 5.1 Progress Monitoring This enables the manager to have quick and easy From the manager interface two tabs display infor- access to the cases of disagreement and, if needed, mation about the annotations already performed. to give feedback to the annotators. The Annotation tab presents the progress of the 5.3 Quality Monitoring annotation task, i.e. the annotations done by each annotator. This is real-time information, which Quality monitoring makes use of a gold standard means that the manager can follow the progress dataset previously annotated by an expert. Each of the work underway. Moreover the manager can annotator is asked to provide an annotation for visualize the annotations of each user in read-only those samples. The annotators do not know if they mode. are annotating a golden sample or not, which en- The Overall stats panel displays a table which sures a non-biased evaluation. This enables the summarizes the overall statistics about the anno- project manager to assess the quality of the an- tation. The following information is given: total notations of each annotator by comparing them number of annotated documents; number of non- against a dataset considered correct. The same annotated documents; number of partially anno- quantitative information and visualization as those tated documents (i.e. documents not yet annotated for IAA monitoring (see Section 5.2) are available. by the required number of annotators); number of completely annotated documents (i.e. documents 6 Applications and Further Extensions 3 The Active Learning process is not provided in the dis- We used A NNOTATOR P RO for multiple projects, tribution of A NNOTATOR P RO, but the tool can select the data on different tasks, including named entity recog- to be annotated if they are associated with a confidence value (in this case the tool can either select those with the highest nition (Minard et al., 2016a), event detection (Mi- score or those with the lowest score). nard et al., 2016b) and sentiment analysis. The Figure 3: Visualization of the annotations made for two documents. The first example is a case of dis- agreement and the second a case of agreement. At the top of the page is given the number of annotations for each tag. tool has been successfully exploited both in situ- References ations with few experienced annotators as well as Valentina Bartalesi Lenzi, Giovanni Moretti, and with more than 20 non-expert annotators (i.e. high Rachele Sprugnoli. 2012. CAT: the CELCT annota- school students) working in parallel. A NNOTA - tion tool. In Proceedings of the Eighth International TOR P RO has been fully integrated within an Ac- Conference on Language Resources and Evaluation, LREC 2012, pages 333–338, Istanbul, Turkey, May tive Learning platform (Magnini et al., 2016) and 23-25, 2012. successfully employed in two industrial projects, resulting in high quality data. Hamish Cunningham, Diana Maynard, Kalina Bontcheva, Valentin Tablan, Niraj Aswani, Ian As for our next steps, we are working to ex- Roberts, Genevieve Gorrell, Adam Funk, Angus tend A NNOTATOR P RO to include relations among Roberts, Danica Damljanovic, Thomas Heitz, annotated entities, such as the relation between a Mark A. Greenwood, Horacio Saggion, Johann verb and its argument/s in semantic role labeling. Petrak, Yaoyong Li, and Wim Peters. 2011. Text Processing with GATE (Version 6). University of A NNOTATOR P RO is distributed as open source Sheffield Department of Computer Science. software under the terms of Apache License 2.0.4 from the web page: http://hlt-nlp.fbk. Christian Girardi, Luisa Bentivogli, Mohammad Amin eu/technologies/annotatorpro. Farajian, and Marcello Federico. 2014. MT-EQuAl: A toolkit for human assessment of machine trans- lation output. In Proceedings of COLING 2014, Acknowledgments the 25th International Conference on Computational Linguistics: System Demonstrations, pages 120– 123, Dublin, Ireland, August 23-29, 2014. ACL. This work has been partially funded by the Euclip- Res project, under the program Bando Inno- Dekang Lin. 1998. An information-theoretic def- vazione 2016 of the autonomous Province of inition of similarity. In Proceedings of the Fif- Bolzano. teenth International Conference on Machine Learn- ing, ICML ’98, pages 296–304, Madison, Wiscon- sin, USA. Morgan Kaufmann Publishers Inc. 4 Bernardo Magnini, Anne-Lyse Minard, Mohammed https://www.apache.org/licenses/ LICENSE-2.0 R. H. Qwaider, and Manuela Speranza. 2016. TextPro-AL: An active learning platform for flexi- ble and efficient production of training data for NLP tasks. In Proceedings of COLING 2016, the 26th In- ternational Conference on Computational Linguis- tics: System Demonstrations, pages 131–135, Os- aka, Japan, December. Anne-Lyse Minard, Mohammed R. H. Qwaider, and Bernardo Magnini. 2016a. FBK-NLP at NEEL-IT: Active learning for domain adaptation. In Proceed- ings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Cam- paign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2016), volume 1749, Napoli, Italy, December 5-7, 2016. Anne-Lyse Minard, Manuela Speranza, Bernardo Magnini, and Mohammed R. H. Qwaider. 2016b. Semantic interpretation of events in live soccer com- mentaries. In Proceedings of Third Italian Confer- ence on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Work- shop (EVALITA 2016), Napoli, Italy, December 5-7, 2016. Pontus Stenetorp, Sampo Pyysalo, Goran Topić, Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsu- jii. 2012. Brat: A web-based tool for NLP-assisted text annotation. In Proceedings of the Demonstra- tions at the 13th Conference of the European Chap- ter of the Association for Computational Linguistics, EACL ’12, pages 102–107, Avignon, France. Asso- ciation for Computational Linguistics. Anthony J. Viera and Joanne M. Garrett. 2005. Under- standing interobserver agreement: The kappa statis- tic. Family Medicine, 37(5):360–363, 5. Seid Muhie Yimam, Iryna Gurevych, Richard Eckart de Castilho, and Chris Biemann. 2013. Webanno: A flexible, web-based and visually supported system for distributed annotations. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: Sys- tem Demonstrations, pages 1–6, Sofia, Bulgaria, August. Association for Computational Linguistics.