Find Problems before They Find You
                    with AnnotatorPro’s Monitoring Functionalities

 Mohammed R. H. Qwaider, Anne-Lyse Minard, Manuela Speranza, Bernardo Magnini
                     Fondazione Bruno Kessler, Trento, Italy
             {qwaider,minard,manspera,magnini}@fbk.eu


                      Abstract                          with a rich apparatus of functionalities (e.g. an-
                                                        notation, visualization, monitoring and reporting),
    English. We present a tool for annota-              able to support and monitor a large variety of an-
    tion of linguistic data. A NNOTATOR P RO            notators (i.e. from linguists to mechanical turk-
    offers both complete monitoring function-           ers), flexible enough to serve a large spectrum
    alities (e.g. inter-annotator agreement and         of annotation scenarios (e.g. crowdsourcing and
    agreement with respect to a gold standard)          paid professional annotators), and open to the in-
    and highly flexible task design (e.g. token         tegration of NLP tools (e.g. for automatic pre-
    and document level annotation, adjudica-            annotation and for instance selection based on Ac-
    tion and reconciliation procedures). We             tive Learning).
    teste A NNOTATOR P RO in several indus-                Although there is a large supply of annotation
    trial annotation scenarios, coupled with            tools, such as brat (Stenetorp et al., 2012), GATE
    Active Learning techniques.                         (Cunningham et al., 2011), CAT (Bartalesi Lenzi
    Italiano. Presentiamo uno strumento per             et al., 2012), and WebAnno (Yimam et al., 2013),
    l’annotazione di dati linguistici. Annota-          and several functions are included in common
    torPro offre sia complete funzionalità di          crowdsourcing platforms (e.g. CrowdFlower1 ),
    monitoraggio (es. accordo tra annotatori,           we believe that none of the available tool possesses
    accordo rispetto ad un gold standard), sia          the full range of functionalities for a real and in-
    la alta flessibilità nel definire task di anno-    tensive industrial use. As an example, none of the
    tazione (per esempio, annotazione per pa-           afore mentioned tools allows one to implement ad-
    role o per documento, procedure di aggiu-           judication rules (i.e. under what condition an item
    dicamento e re-conciliazione). Annotator-           annotated by more than one annotator is assigned
    Pro è stato sperimentato in diversi scenari        to a certain category) or to visualize items with
    di annotazione industriali, accoppiato con          disagreement among annotators.
    tecniche di Active Learning.                           This paper introduces A NNOTATOR P RO, a new
                                                        annotation tool which was mainly conceived to
                                                        fulfill the above-mentioned needs. We highlight
1   Introduction                                        two main aspects of the tool: (i) a high level of
                                                        flexibility to design the annotation task, including
Driven by the popularity of machine learning ap-
                                                        the possibility to define adjudication and reconcil-
proaches, there has been in the last years an in-
                                                        iation procedures; (ii) the rich set of functionalities
creasing need to produce human annotated data for
                                                        allowing for constant monitoring of the quality of
a large number of linguistic tasks (e.g. named en-
                                                        the data being annotated.
tity recognition, semantic role labeling, sentiment
                                                           The paper is organized as follows. In Section 2
analysis, word sense disambiguation, and dis-
                                                        we compare A NNOTATOR P RO with some state-of-
course relations, just to mention a few). Datasets
                                                        the-art annotation tools. Section 3 provides a gen-
(development, training and test data) are being de-
                                                        eral description of the tool. Sections 4 and 5 focus
veloped for different languages and different do-
                                                        on the task design and on the monitoring function-
mains, both for research and industrial purposes.
                                                        alities, while Section 6 provides a brief overview
   A relevant consequence of this is the increas-
                                                        of the tool’s application and future extensions.
ing demand for annotated datasets, both in terms
                                                           1
of quantity and quality. This in turn calls for tools          https://www.crowdflower.com
2   Related Work                                        tive translations), and word alignment (Girardi et
                                                        al., 2014).
Many annotation tools are available to the com-
                                                           A NNOTATOR P RO inherits from MT-EQ UA L
munity. However, some of them are limited by
                                                        the capability of scaling over big data in an op-
license, e.g. CAT (Bartalesi Lenzi et al., 2012) and
                                                        timized platform that is able to save annotation in
GATE (Cunningham et al., 2011) are available for
                                                        real-time. It also makes use of the MT-EQ UA L
research use only, while some others have open li-
                                                        web-based interface which is a multi-user and
censes, e.g. brat (Stenetorp et al., 2012), but offer
                                                        user-friendly interface.
limited features.
                                                           It performs simple tokenization based on
   The brat rapid annotation tool (brat) is an
                                                        spaces, punctuation, and other language-
open license annotation tool that supports differ-
                                                        dependent rules, but the user can also upload
ent annotation levels, in particular annotation at
                                                        directly tokenized files.
the token level and annotation of relations between
marked tokens. It supports multiple annotators,
                                                           We designed new functionalities to fulfill the
in the sense that many annotators can collaborate
                                                        requirements of high quality corpus annotation
on annotating the same corpus, but needs an in-
                                                        performed by multiple annotators. A NNOTATOR -
house installation. Despite all these advantages,
                                                        P RO’s main novel features are:
brat does not support either annotation monitoring
or annotator/task reports.                                  • The interface includes different options to de-
   Other tools (e.g. CAT) provide advanced func-              sign the annotation task (Section 4.1), which
tionalities to perform annotation at different lev-           are set by the project manager.
els (e.g. token and relation level) through a user-
friendly interface, although they do not support an-        • The tool enables annotation at two levels
notation monitoring.                                          (Section 4.2): annotation at the token level
   CrowdFlower is an outsourcing annotation ser-              (e.g. part-of-speech tagging and named entity
vice that provides a platform for annotation (fo-             recognition) and annotation at the document
cusing on annotation at the document level) em-               level (e.g. sentiment analysis).
ploying non expert contributors. It uses gold
                                                            • A NNOTATOR P RO’s interface offers function-
standard tests to evaluate the annotators and sup-
                                                              alities for annotation monitoring (Section
ports automatic adjudication features, but no inter-
                                                              5), which include inter-annotator agreement
annotator agreement metrics are available. In ad-
                                                              (IAA) monitoring and quality monitoring.
dition an important issue which could limit the use
of outsourcing is the non in-house storage of the
data, in particular when sensitive data covered by         A NNOTATOR P RO has been implemented in
privacy regulations are concerned.                      PHP and JavaScript, and uses MySQL to manage
   GATE is a powerful tool that implements most         a database. It takes as input several UTF-8 en-
of the features to facilitate the annotation produc-    coded formats: TXT (raw text), IOB22 and TSV
tion in all its phases (e.g. task creation, annota-     (tab separated values). It also accepts ZIP archives
tor assignment, annotation monitoring and multi-        containing the source files.
layer annotation of the same corpus). However,             As regards data storage, document’s annota-
visualization of disagreement is not available and      tions are saved in a MySQL database in real time
no automatic adjudication is available.                 (i.e. while data being annotated). The annotated
                                                        data can be exported in the following formats:
3   Overall Description
                                                        IOB2 and TSV.
A NNOTATOR P RO is a web-based annotation tool
built on top of the open source tool MT-EQ UA L         4    Annotation Task Design
(Machine Translation Error Quality Alignment),          A NNOTATOR P RO distinguishes two types of
a toolkit for the manual assessment of Machine          users, i.e. managers and annotators. Managers
Translation output that implements three different
                                                            2
tasks in an integrated environment: annotation of             The IOB2 tagging format is a common format for text
                                                        chunking. B- is used to tag the beginning of a chunk, I- to tag
translation errors, translation quality rating (e.g.    tokens inside the chunk and O to indicate tokens not belong-
adequacy and fluency, relative ranking of alterna-      ing to a chunk.
Figure 1: Annotator’s task definition: annotation level, task’s name, task description, and annotation
categories.


              Figure 2: An example annotation interface: sentiment annotation of tweets.


take care of designing the annotation task at hand;    • Defining the automatic adjudication rules
in particular, they (i) define the annotation proce-     in the case where multiple annotations of
dure, which depends on the number of annotators,         the same data are collected (document level
their level of expertise (for example, non-expert        only). The two basic options are:
annotators might not be allowed to see/modify
each other’s work) and the use that the dataset is          – considering an annotation as solved if
intended for (e.g. evaluation, training, etc.), and           the majority of annotators agreed on a
(ii) the annotator’s task, which includes selecting           certain annotation;
the most appropriate annotation level and creat-            – considering an annotation as solved if a
ing the annotation categories/labels (Figure 1). As           minimum number of concordant anno-
opposed to managers, annotators are basic users,              tations is reached.
who only have access to a limited number of (an-
notation) functionalities (Figure 2).                  • Deciding whether to make the metadata of
                                                         the documents (e.g. document id, document
4.1   Annotation Procedure                               title) visible to the annotators during the an-
One of the main tasks of the manager is to define        notation phase.
the annotation procedure, which consists mainly
                                                       • Deciding whether to allow for a revision
of:
                                                         phase after the annotation has been con-
  • Defining the number of annotators (one or            cluded, i.e. give the annotators the possibility
    more) who can collaborate on annotating the          to modify their annotations, for example after
    same corpus.                                         a reconciliation step has taken place. By de-
                                                         fault, document metadata will be visible dur-
  • In case of multiple annotators, defining             ing the revision phase to facilitate the work.
    the type of collaboration among them, i.e.
    whether data are to be annotated only by one       • Decide the modality for the selection of data
    or more of them (document level only).               to be presented to the annotators:
          – propose to the annotator preselected or-              annotated by the required number of annotators,
            dered documents (default option);                     independently of whether annotators did or did not
          – randomly select documents from a large                reach an agreement).
            dataset;
                                                                  5.2    Inter-Annotator Agreement Monitoring
          – select documents from a large dataset
            through an Active Learning process.3                  IAA monitoring, which measures the level of
                                                                  agreement between the annotators at regular inter-
4.2      Annotator’s Task                                         vals, is activated every time two or more annota-
A NNOTATOR P RO supports two different annota-                    tors annotate the same data.
tion levels, i.e one where annotation is performed                   IAA agreement is computed in terms of Dice
at the document level and one where we have                       coefficient (Lin, 1998) and Cohen’s Kappa (Viera
smaller units, typically tokens, being annotated. It              and Garrett, 2005); the latter represents the agree-
is the manager’s task to select the most appropri-                ment as a continuous value from -1 to 1, where -1
ate annotation level for the task at hand; for exam-              means total disagreement and 1 means total agree-
ple, named entity recognition needs data annotated                ment.
at the token level, whereas for sentiment analysis                   The project manager has access to different
a corpus is generally annotated at the document                   types of information to constantly monitor the
level.                                                            level of agreement between annotators, focusing
   Finally, the task manager defines the set of cat-              both on a single annotator and overall:
egories or the set of labels to be used by the an-
notator respectively to classify the documents (in                    • the level of agreement each annotator obtains
the case of document level annotation) or to mark                       with every other annotator and the average of
portion of text.                                                        the IAA values obtained by each annotator;

5       Annotation Monitoring                                         • the overall average IAA.

In A NNOTATOR P RO we have implemented several                       A NNOTATOR P RO also provides a visualization
monitoring functionalities aimed at guaranteeing                  of the annotations made by each annotator for
high quality annotation as described below.                       each document, where a different color is used to
                                                                  present each tag from the tagset (see Figure 3).
5.1      Progress Monitoring                                      This enables the manager to have quick and easy
From the manager interface two tabs display infor-                access to the cases of disagreement and, if needed,
mation about the annotations already performed.                   to give feedback to the annotators.
The Annotation tab presents the progress of the
                                                                  5.3    Quality Monitoring
annotation task, i.e. the annotations done by each
annotator. This is real-time information, which                   Quality monitoring makes use of a gold standard
means that the manager can follow the progress                    dataset previously annotated by an expert. Each
of the work underway. Moreover the manager can                    annotator is asked to provide an annotation for
visualize the annotations of each user in read-only               those samples. The annotators do not know if they
mode.                                                             are annotating a golden sample or not, which en-
   The Overall stats panel displays a table which                 sures a non-biased evaluation. This enables the
summarizes the overall statistics about the anno-                 project manager to assess the quality of the an-
tation. The following information is given: total                 notations of each annotator by comparing them
number of annotated documents; number of non-                     against a dataset considered correct. The same
annotated documents; number of partially anno-                    quantitative information and visualization as those
tated documents (i.e. documents not yet annotated                 for IAA monitoring (see Section 5.2) are available.
by the required number of annotators); number of
completely annotated documents (i.e. documents                    6     Applications and Further Extensions
    3
     The Active Learning process is not provided in the dis-      We used A NNOTATOR P RO for multiple projects,
tribution of A NNOTATOR P RO, but the tool can select the data    on different tasks, including named entity recog-
to be annotated if they are associated with a confidence value
(in this case the tool can either select those with the highest   nition (Minard et al., 2016a), event detection (Mi-
score or those with the lowest score).                            nard et al., 2016b) and sentiment analysis. The
Figure 3: Visualization of the annotations made for two documents. The first example is a case of dis-
agreement and the second a case of agreement. At the top of the page is given the number of annotations
for each tag.


tool has been successfully exploited both in situ-   References
ations with few experienced annotators as well as    Valentina Bartalesi Lenzi, Giovanni Moretti, and
with more than 20 non-expert annotators (i.e. high     Rachele Sprugnoli. 2012. CAT: the CELCT annota-
school students) working in parallel. A NNOTA -        tion tool. In Proceedings of the Eighth International
TOR P RO has been fully integrated within an Ac-       Conference on Language Resources and Evaluation,
                                                       LREC 2012, pages 333–338, Istanbul, Turkey, May
tive Learning platform (Magnini et al., 2016) and
                                                       23-25, 2012.
successfully employed in two industrial projects,
resulting in high quality data.                      Hamish Cunningham, Diana Maynard, Kalina
                                                       Bontcheva, Valentin Tablan, Niraj Aswani, Ian
   As for our next steps, we are working to ex-
                                                       Roberts, Genevieve Gorrell, Adam Funk, Angus
tend A NNOTATOR P RO to include relations among        Roberts, Danica Damljanovic, Thomas Heitz,
annotated entities, such as the relation between a     Mark A. Greenwood, Horacio Saggion, Johann
verb and its argument/s in semantic role labeling.     Petrak, Yaoyong Li, and Wim Peters. 2011. Text
                                                       Processing with GATE (Version 6). University of
   A NNOTATOR P RO is distributed as open source       Sheffield Department of Computer Science.
software under the terms of Apache License 2.0.4
from the web page: http://hlt-nlp.fbk.               Christian Girardi, Luisa Bentivogli, Mohammad Amin
eu/technologies/annotatorpro.                          Farajian, and Marcello Federico. 2014. MT-EQuAl:
                                                       A toolkit for human assessment of machine trans-
                                                       lation output. In Proceedings of COLING 2014,
Acknowledgments                                        the 25th International Conference on Computational
                                                       Linguistics: System Demonstrations, pages 120–
                                                       123, Dublin, Ireland, August 23-29, 2014. ACL.
This work has been partially funded by the Euclip-
Res project, under the program Bando Inno-           Dekang Lin. 1998. An information-theoretic def-
vazione 2016 of the autonomous Province of             inition of similarity. In Proceedings of the Fif-
Bolzano.                                               teenth International Conference on Machine Learn-
                                                       ing, ICML ’98, pages 296–304, Madison, Wiscon-
                                                       sin, USA. Morgan Kaufmann Publishers Inc.
  4                                                  Bernardo Magnini, Anne-Lyse Minard, Mohammed
    https://www.apache.org/licenses/
LICENSE-2.0                                            R. H. Qwaider, and Manuela Speranza. 2016.
  TextPro-AL: An active learning platform for flexi-
  ble and efficient production of training data for NLP
  tasks. In Proceedings of COLING 2016, the 26th In-
  ternational Conference on Computational Linguis-
  tics: System Demonstrations, pages 131–135, Os-
  aka, Japan, December.
Anne-Lyse Minard, Mohammed R. H. Qwaider, and
  Bernardo Magnini. 2016a. FBK-NLP at NEEL-IT:
  Active learning for domain adaptation. In Proceed-
  ings of Third Italian Conference on Computational
  Linguistics (CLiC-it 2016) & Fifth Evaluation Cam-
  paign of Natural Language Processing and Speech
  Tools for Italian. Final Workshop (EVALITA 2016),
  volume 1749, Napoli, Italy, December 5-7, 2016.

Anne-Lyse Minard, Manuela Speranza, Bernardo
  Magnini, and Mohammed R. H. Qwaider. 2016b.
  Semantic interpretation of events in live soccer com-
  mentaries. In Proceedings of Third Italian Confer-
  ence on Computational Linguistics (CLiC-it 2016)
  & Fifth Evaluation Campaign of Natural Language
  Processing and Speech Tools for Italian. Final Work-
  shop (EVALITA 2016), Napoli, Italy, December 5-7,
  2016.
Pontus Stenetorp, Sampo Pyysalo, Goran Topić,
  Tomoko Ohta, Sophia Ananiadou, and Jun’ichi Tsu-
  jii. 2012. Brat: A web-based tool for NLP-assisted
  text annotation. In Proceedings of the Demonstra-
  tions at the 13th Conference of the European Chap-
  ter of the Association for Computational Linguistics,
  EACL ’12, pages 102–107, Avignon, France. Asso-
  ciation for Computational Linguistics.
Anthony J. Viera and Joanne M. Garrett. 2005. Under-
  standing interobserver agreement: The kappa statis-
  tic. Family Medicine, 37(5):360–363, 5.
Seid Muhie Yimam, Iryna Gurevych, Richard
  Eckart de Castilho, and Chris Biemann. 2013.
  Webanno: A flexible, web-based and visually
  supported system for distributed annotations. In
  Proceedings of the 51st Annual Meeting of the
  Association for Computational Linguistics: Sys-
  tem Demonstrations, pages 1–6, Sofia, Bulgaria,
  August. Association for Computational Linguistics.