Harnessing Disagreement for Event Semantics

                               Lora Aroyo1,2 , Chris Welty2
                                1
                                  VU University Amsterdam
                                   lora.aroyo@vu.nl
                               2
                                 IBM Watson Research Center
                                  cawelty@gmail.com


       Abstract. The focus of this paper is on how events can be detected & extracted
       from natural language text, and how those are represented for use on the seman-
       tic web. We draw an inspiration from the similarity between crowdsourcing ap-
       proaches for tagging and text annotation task for ground truth of events. Thus,
       we propose a novel approach that harnesses the disagreement between the hu-
       man annotators by defining a framework to capture and analyze the nature of the
       disagreement. We expect two novel results from this approach. On the one hand,
       achieving a new way of measuring ground truth (performance), and on the other
       hand identifying a new set of semantic features for learning in event extraction.


1   Introduction
Events play an important role in human communication. Our understanding of the
world is transferred to others through stories, in which objects and abstract notions
are grounded in space and time through their participation in events. In conventional
narrative, these events unfold sequentially in a timeline. Upon inspection, however, our
understanding of events is quite difficult to pin down. This can be seen in metaphysics,
where theories range from events as the most basic kind of entity in the universe to
events as an unreal fiction [1], and in Natural Language Processing (NLP), where the
few annotation tasks for events that have been performed have shown very low inter-
annotator agreement.
     One of the simplest and most prevalent ontological views of the universe is that
there are two basic kinds of entities, objects and events. They are distinguished in that
events perdure (their parts exist at different time points) and objects endure (they have
all their parts at all points in time) [2]. The distinction is sometimes phrased ”objects
are wholly present at any point in time, events unfold over time.” This definition and
distinction is not universally held, but it serves us here as a convenient reference point;
we believe the conclusion holds regardless of the ontological status of events.
     The importance of events and their interpretation is widely recognized in NLP, but
solutions remain elusive, whereas NLP technology for detecting objects (such as people,
places, organizations, etc.) in text has reached ”off the shelf” levels of maturity. In
addition, there is comparatively little annotated data for training and evaluation of event
detection systems, and the bulk of what is available is difficult to reproduce. Annotator
disagreement is quite high in most cases, and since many believe this is a sign of a
poorly defined problem, guidelines for these event annotation tasks are very precise in
2

order to address and resolve specific kinds of disagreement. This leads to brittleness or
over generality, making it difficult to transfer annotated data across domains or to use
the results for anything practical.
     One of the reasons for annotator disagreement is that events are highly composi-
tional in the way they are described in language. Objects are compositional, too, but
only in reality – in language we rarely refer to the parts of the object, only to the object
itself. For events, we often describe where and when they take place, who or what the
participants were, what the causes or results of the event were, and what type of event it
was. More importantly events are usually referred to through their parts, e.g. we might
talk about a terrorist event by using the word ”explosion”, which literally refers to only
a small part of the overall event, making it sometimes difficult to determine whether
two parts of one event refer to the same thing.
     This highly compositional nature means that there are more potential ways in which
two human annotators can disagree about a single event. Since agreement is never per-
fect for any annotation task, the agreement for a composite annotation task will nec-
essarily degrade as the product of the agreement for the sub-tasks. In other words, if
events are taken to be a time, place, actor, patient, and type, the agreement for the event
task will be the product of the agreement on the five sub-tasks, which would be low
since agreement for any task is between 0 and 1.
     In our efforts to study the annotator disagreement problem for events, we began
to realize that the disagreement didn’t really change people’s understanding of a news
story or historical description. People seem to live with the vagueness of events per-
fectly well; the lack of precision and identity in event detection began to seem like
artificial problems. This led us to the hypothesis of this paper, that the kind of anno-
tator disagreement we see is a natural state, and that event semantics, both individual
and social, is by its very nature imprecise and varied. We propose to harness this by
incorporating disagreement as parameter of the annotated meaning of events using a
crowdsourcing approach, which allows for capturing the wide range of interpretations
of events with a minimal requirement for agreement (only for e.g. spam detection). We
can then use a form of semantic clustering by defining a similarity space not of lexi-
cal features of language, but of dimensions that come from a classification of human
disagreement on event interpretation.
     In this preliminary work we present the classification framework and annotation
task, and describe how it will be used for event detection. This work is performed in the
context of the DARPA’s Machine Reading program (MRP)3


2     Classification Framework
Our classification of the multitude of event perspectives derives from, and forms the
basis for understanding, the disagreement in the crowd-sourced event annotation task,
and we use it further to define similarity between events identified by the annotators.
Methodologically, the initial set of classifications in the framework were produced by
observing disagreement in previous annotation tasks, and we expect to further extend
and refine the set as we conduct new annotation tasks.
 3
     http://www.darpa.mil/Our Work/I2O/Programs/Machine Reading.aspx
                                                                                           3

      We identify three high-level views to disagree on the annotation of events:
 – ontology: disagreements on the basic status of events themselves as referents of
   linguistic utterances, for example are people events or do events exist at all.
 – granularity: disagreements that result from issues of granularity, such as the loca-
   tion being a country, region, or city, the time being a day, week, month, etc.
 – interpretation: disagreements that result from (non-granular) ambiguity, differences
   in perspective, or error in interpreting an expression, for example classifying a per-
   son as a terrorist or hero, the ”October Revolution” took place in September, etc.

2.1    Ontology Disagreements
We do not address ontological disagreements on events in this paper, and we assume an-
notation tasks to be defined by a particular ontology. The literature and history of event
ontology is vast, see [1] for a good start. We assume for the purposes of this framework
that events do exist (it is a particular ontological position that they don’t), that they are
located in space, occur over some time, have a prescribed type, have temporal parts, and
have participants. This gives us five dimensions in which to classify possible annotator
disagreements (space, time, classification, composition, and participation).

2.2    Granularity Disagreements
We consider disagreements on levels of granularity to be, for the most part, agreement
about what the event refers to but disagreement about what level of detail is important
to extract and identify the event.
 – Spatial granularity disagreements occur when the location can be specified at sizes
   within some regional containment. If a sentence said, ”...a bombing in a downtown
   Beirut market...” the event might have taken place in ”downtown Beirut”, ”Beirut”,
   even ”Lebanon” or ”Middle East”. Each is correct, but typical gold standards define
   only one to be.
 – Temporal granularity disagreements occur when the time can be specified at dif-
   ferent durations of temporal containment. If a sentence said, ”...a bombing last
   Wednesday during the busy lunch hour...” might have taken place at ”lunch hour”,
   ”last Wednesday”, even ”last week”, ”2001”, etc.
 – Compositional granularity disagreements occur when events are referred to by their
   parts at different levels of composition. Events are infinitely decomposable, and
   while this won’t be reflected explicitly in a textual description, the composition-
   ality does manifest as an abundance of ways of referring to what happened. If a
   sentence said, ”...a bombing took place last week, the explosion rocked the central
   marketplace...” we might say the event ”explosion” is part of the event ”bombing”
   and that the ”explosion” event is not the one of interest. There are many types
   of compositional disagreement (see section 2.3 below), here we refer only to dis-
   agreements in labeling the events in a way that affects counting, e.g. are there two
   events in the sentence or one? This category includes aggregate event mentions,
   such as ”5 bombings in Beirut”, for which annotators may disagree on whether the
   ”5 bombings” is one event with 5 parts, or 5 events.
4

    – Classificational granularity disagreements occur when events are classified at dif-
      ferent places in a given taxonomy, such that one class subsumes the other. If the
      annotators were provided with a taxonomy of events that specified bombing  at-
      tack  event, they may disagree on whether a particular event is a ”bombing” or
      ”attack”.
    – Participant granularity disagreements occur when event participants are part of
      some group that can be identified at different levels. If a sentence said, ”... a shoot-
      ing by Israeli soldiers ...” we might say the participants are ”soldiers”, ”Israeli sol-
      diers”, ”Israeli Army”, or ”Israel”.
     Thus, the identification of an event by human annotators can disagree in any of
these granular dimensions with respect to the words used in the annotated text, while
still representing a general agreement about the event itself. It is a peculiarity of NLP
annotation tasks that this would be considered disagreement at all.
     Often we observe disagreement in granularity when different levels of detail are
needed to distinguish different events that share some property at some level. For ex-
ample, if there were two bombings in Beirut on September 5th, some annotators would
consider it more important to fix the time of day for each bombing or the participants
mentioned by their role and name.
     In previous attempts to define event annotation tasks, researchers have typically
“perfumed” annotator disagreement on granularity by forcing one choice in particular
contexts. Examples include fixing the granularity for all events to a day, if a day is
unavailable, the week, then month, then year, then decade. This is regardless of whether
that choice is believed by the annotator to be the most relevant level of detail, or even
correct. These choices may reduce disagreement according to some measure, but we
argue that they do not fix the problem, they simply cover it up: they are brittle in that
they cannot be reused for applications requiring a different granularity, they make the
task harder to learn (for machines) as they force an interpretation that people may not
consistently have, and they occasionally force annotators to make the wrong choices in
certain situations, even when they know its wrong.

2.3     Interpretation Disagreements
Disagreements on interpretation reflect genuine disagreement about what the event
refers to. As with granularity, the disagreement can come from an event’s relation to
other entities, and we break interpretation disagreements into the same five dimensions.
Interpretation disagreements also include errors and misunderstandings by the annota-
tors.
    – Spatial interpretation disagreements occur when the location is vague, controver-
      sial, has some context that may change the coordinates, or perspectives that change
      some element of the spatial containment across annotators. For example, the loca-
      tion of a bombing could be ”the front lines”, which may be shifting and difficult to
      pin down latitude and longitude, or ”Prussia” which is still the name of a region but
      once also the name of a much larger country. A location, such as Taiwan, may be
      considered by one annotator to be part of the People’s Republic of China, and by
      another not to be.
                                                                                           5

    – Temporal interpretation disagreements, similar to spatial, may occur when the time
      is vague or has some context that changes the actual time points. For example, the
      time of a bombing may be reported in a country whose time zone makes the time
      or even the day of the event different, or expressions like ”the past couple days” in
      which one annotator may take it to be a duration of two days, and another may take
      as a different duration. Relative dates like ”the end of world war II” or ”the October
      Revolution” (which took place in September) can also cause genuine disagreement
      among annotators if required to normalize the date to a specific year, month and
      day.
    – Compositional interpretation disagreements occur when events are referred to by
      their parts and the annotators disagree on what the parts are. This includes the direc-
      tion of the composition, e.g. ”bombing” is part of the ”explosion”, or ”explosion” is
      part of the ”bombing” in the previous example. This also includes the placement by
      annotators of implied events that contain, or are contained by, the mentioned ones.
    – Classificational interpretation disagreements occur when events are classified un-
      der different classes, and one class does not imply the other (as opposed to granu-
      larity). This includes cases where the two classes are logically disjoint, and cases
      where they are not disjoint but in different branches of the taxonomy.
    – Participant interpretation disagreements occur when the participants are vague (e.g.
      ”Western Authorities”), or controversial (e.g. ”Pakistan denied responsibility for
      the bombing”), or has some context that causes an annotator to differ from others.
      For example, in ”Saddam Hussein’s top advisor called the bombing an outrage” an
      annotator might assume that the advisor would not have spoken unless it was what
      he was told to say, and attribute ”Saddam Hussein” as the participant in the ”called”
      event, whereas a stricter reading would have the advisor as the participant.

     The most common form of interpretation disagreements are ones that stem from
misreadings of the text. It is important to note that most of the time, human readers are
very tolerant of these kinds of errors in forming their understanding of what happened.
It is more reasonable to try and ”correct” these errors to reduce disagreement, but we
claim that if annotation is to scale, we need to be tolerant of them.
     Interpretation disagreements are more difficult to account for than the granularity
disagreements. Thus, we start with the first version of this crowdsourced annotation
experiment by focussing on granularity disagreements only.


3     Annotation Task

NLP systems typically use the ground truth of an annotated corpus in order to learn
and evaluate their output. Traditionally, the ground truth is determined by humans an-
notating a sample of the text corpus with the target events and entities, with the aim
to optimize the inter-annotator agreement by restricting the definition of events and
providing annotators with very precise guidelines. In this paper, we propose an alterna-
tive approach for the event annotation, which introduces a novel setting and different
perspective on the overall goal.
6

                                  Table 1. Annotation Matrix for Putative Eventi

     Eventi        Temporal                  Spatial             Participants Compositional Classificational
              1 2 3 4 5 ø 1 2 3 4 5 ø 1 2 3 4 5 ø 1 2 3 4 5 ø 1 2 3 4 5 ø
     ann1     1 0 1 1 0 1 0 1 1 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 1 0 1 1 1 0
       ...    ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
     annN     1 1 1 1 0 1 0 1 1 0 1 0 1 1 0 0 1 1 1 0 0 0 0 0 1 0 1 1 1 0


    By analogy to image and video tagging crowdsourcing games, e.g. Your Paintings
Tagger 4 and Yahoo! Video Tag Game [3], we envision that a crowdsourcing setting
could be a good candidate to the problem of insufficient annotation data. However, we
do not exploit the typical crowdsourcing agreement between two or more independent
taggers, but on the contrary, we harness their disagreement. Our goal is to allow for a
maximum disagreement between the annotators in order to capture a maximum diver-
sity in the event expressions.

Annotation Matrix: In section 2 we introduced a classification framework to under-
stand the disagreement between annotators. In our annotation task we only consider
the granularity-based disagreement – with five axes and five levels of granularity for
each axis. Following this, for each putative event (a marked verb or nominalized verb),
we build an Annotation Matrix (Table 1) from the input of all annotators. We can then
subsequently use these annotation matrices for an analysis over the whole collection
of events, e.g. for determining similarity between different events and thus recogniz-
ing missed coreferences. We can also use the matrices for an analysis of the annotation
space of each individual event. For example, the highest agreement in each axis level
could indicate the most likely granularity for this event, while still giving a sense of
the range of acceptable granularities in each dimension. Such in-depth analysis of the
annotations can allow us to identify a new set of features that can help to improve the
event extraction. For example, we could thus expect to find dependencies between the
type of events and the level of granularity for its spatial or temporal entities.

Annotation Setting: For the proposed annotation task we plan to use a sample of
the 10, 000 documents taken from the Gigaword corpus (used in the context of the
DARPA’s Machine Reading program (MRP)5 ) together with several sources for back-
ground knowledge. The background knowledge includes, for example, the IC++ Do-
main Ontology for Violent Events (identifying event types and binary relations), geo-
graphical and temporal resources as well as general lexical resources such as WordNet
and DBpedia.
    A pre-annotation is performed by automatically marking all the verbs and nomi-
nalized verbs as putative events (Fig. 2): this would include both events from the IC++
ontology, as well as reporting and other communication events. The IBM Human Anno-
tation Tool (HAT) was used as an initial annotation interface. Our background knowl-
 4
     http://tagger.thepcf.org.uk/
 5
     http://www.darpa.mil/Our Work/I2O/Programs/Machine Reading.aspx
                                                                                         7


                                Fig. 1. Annotation interface


edge base allows us to pre-label temporal, spatial, and participant entities with granular-
ities (e.g. city, region, country), and we provide an a-priori mapping from these to the
numbers in the annotation matrix. The annotators do not need to know the granularity
level, they are presented with all the possible choices and they select one (or more), and
their choices are automatically mapped into the matrix. For example, for the sentence,
”A bomb exploded in Beiruit, Lebanon last Friday,” the annotator would be presented
with ”exploded” as the putative event, and could select between Beirut and Lebanon (or
both) as the location. Since our background knowledge includes that Beirut is a city and
Lebanon a country, if selected as a location for the event these are mapped to granularity
levels 2 and 3, resp.
     We ran explorative annotation experiments with the IBM Human Annotation Tool
(Fig. 1), and proceeded further with using larger annotator pool at Amazon Mechanical
Turk and CrowdFlower. Annotation data was collected according to the stages sketched
in Fig. 2. As presented in the figure, the process comprises of four Phases (I-IV). Each
Phase is split in two main steps: (A) collecting initial set of annotations (in each Phase
different types of annotations) and (B) performing spam filtering step. In each phase we
select from the A results items that can be used as Gold Standard items in step B.


4   Related Work
This work derives directly from our efforts in the Machine Reading Program (MRP) to
define an annotation task for event coreference. The process of developing guidelines
is very iterative - starting with an initial set of requirements from simple examples, the
guidelines are then applied by a small group and the disagreements, in particular, are
studied and the guidelines modified to address them. The process is repeated until the
8


                         Fig. 2. Crowdsourcing Annotation Process


agreement (typically a κ score) reaches an acceptable threshold, and then is distributed
to the actual annotators. Developing the guidelines usually takes several months and
requires language experts.
     The idea of analyzing and classifying annotator disagreement on a task is therefore
not new, but part of the standard practice in developing guidelines, which are widely
viewed as necessary for human annotation tasks. However, the goals of classifying dis-
agreement, in most previous efforts, has been designed to eliminate it, not to exploit it.
This can be seen in most annotation guidelines for NLP tasks. For example, in [4], the
instructions include that all modality annotations should, “ignore temporal components
of meaning. For example, a belief stated in the future tense (Mary will meet the pres-
ident tomorrow) should be annotated with the modality ‘firmly believes’ not ‘intends’
or ‘is trying’.” [4]. Here the guideline authors repeat that these choices should be made,
“even though other interpretations can be argued.”
     Similarly, in the annotator guidelines for the MRP Event Extraction Experiment
(aiming to determine a baseline measure for how well machine reading systems extract
attacking, injuring, killing, and bombing events) [5] show examples of restricting hu-
mans to follow one interpretation, for example for location, in order to ensure higher
chance for the inter-annotator agreement. In this case, the spatial information is re-
stricted only to “country”, even though other more specific location indicators might be
present in the text, e.g. the Pentagon.
     There are many annotation guidelines available on the web and they all have exam-
ples of “perfuming” the annotation process by forcing constraints to reduce disagree-
                                                                                       9

ment (with a few exceptions). In [6] and subsequent work in emotion [7], disagreement
is used as a trigger for consensus-based annotation in which all disagreeing annotators
are forced to discuss and arrive at a consensus. This approach acheives very high κ
scores (above .9), but it is not clear if the forced consensus really achieves anything
meaningful. It is also not clear if this is practical in a crowdsourcing environment.
    A good survey and set of experiments using disagreement based semi-supervised
learning can be found in [8]. However, they use disagreement to describe a set of tech-
niques based on bootsrapping, not collecting and exploiting the disagreement between
human annotators. The bootstrapping idea is that small amounts of labelled data can
be exploited with unlabelled data in an iterative process [9], with some user-relevance
feedback (aka active learning).
    Disagreement harnessing and crowdsourcing has previously been used by [10] for
the purpose of word sense disambiguation, and we will explore similar strategies in
our experiments for event modeling. As in our approach, they form a confusion matrix
from the disagreement between annotators, and then use this to form a similarity cluster.
In addition to applying this technique to events, our work adds a novel classification
scheme for annotator disagreement that provides a more meaningful feature space for
the confusion matrix; it remains to be demonstrated whether this will have impact.
    The key idea behind our work is that harnessing disagreement brings in multiple per-
spectives on data, beyond what experts may believe is salient or correct. This concept
has been demonstrated previously in the Waisda? video tagging game [11], in which
lay (non-expert) users provided tags for videos in a crowdsourcing game. The Wasida?
study showed that only 14% of tags provided by lay users could be found in the profes-
sional video annotating vocabulary (GTAA), which indicates a huge gap between the
professional and lay users’ views on what is important in a video. The study showed
the lay user tags were meaningful (as opposed to useless or erroneous ), and the mere
quantity of tags was a success factor in retrieval systems for these multimedia objects.
Similarly, the steve.museum project [12] studied the link between a crowdsourced user
tags folksonomy and the professionally created museum documentation. The results
showed that users tag artworks from a different perspective than that of museum pro-
fessionals: again in this seperate study only 14% of lay user tags were found in the
expert-curated collection documentation.


5   Conclusions

When considering approaches for detecting and extracting events in natural language
text and representing those extracted events for use in the Semantic Web, we see the
implications of what differentiates events from objects. When it comes to annotation
tasks, the compositional nature of events plays an important role in the way in which
annotators perceive the events, annotate them and agree in their existence.
    For the goal of improving event detection, we have chosen to leverage the annota-
tor disagreement in order to obtain an event description that allows machine readers to
better identify and detect events. In this way, we do not aim for annotator agreement (as
in many tagging scenarios where similarity is an indicator for success), but on the con-
trary we hypothesized that annotator disagreement for even annotation actually could
10

provides us with a better event description from the perspective of automatic event de-
tection. By factoring in the different viewpoints that annotators can have, the likelihood
of identifying events that have been represented with such viewpoints is higher.
    In this paper we have contributed a classification framework of the variety of ways
in which people can perceive events, with a matrix for the identification of patterns of
agreement and disagreement (with the aim to be able later to exploit them in the MR
of events), and with a description of the design of the experiment to verify the effect of
using the matrix in the annotation task.


6    Acknowledgments
The authors gratefully acknowledge the support of Defense Advanced Research Projects
Agency (DARPA) Machine Reading Program under Air Force Research Laboratory
(AFRL) prime contract no. FA8750-09-C-0172. Any opinions, findings, conclusion or
recommendations ex-pressed in this material are those of the authors and do not neces-
sarily reflect the view of the DARPA, AFRL, or the US government. We would like to
thank Sid Patwardhan from IBM Research for his contribution to the implementation.


References
 1. Higginbotham, J., Pianesi, F., Varzi, A.: Speaking of Events. Oxford University Press, USA
    (2000)
 2. Lewis, D.K.: On the Plurality of Worlds. Blackwell Publishers (1986)
 3. van Zwol, R., Garcia, L., Ramirez, G., Sigurbjornsson, B., Labad, M.: Video tag game. In:
    17th International World Wide Web Conference (WWW developer track), ACM (April 2008)
 4. Baker, K., Bloodgood, M., Diab, M., Dorr, B., Hovy, E., Levin, L., McShane, M., Mitamura,
    T., Nirenburg, S., Piatko, C., Rambow, O., Richardson, G.: Simt scale 2009 modality anno-
    tation guidelines. Technical Report 4, Human Language Technology Center of Excellence
    (2010)
 5. Hovy, E., Mitamura, T., Verdejo, F.: Event coreference annotation manual. Technical report,
    Information Sciences Institute (ISI) (2012)
 6. Ang, J., Dhillon, R., Krupski, A., Shriberg, E., Stolcke, A.: Prosody-based automatic de-
    tection of annoyance and frustration in human-computer dialog. In: in Proc. ICSLP 2002.
    (2002) 2037–2040
 7. Litman, D.J.: Annotating student emotional states in spoken tutoring dialogues. In: In Proc.
    5th SIGdial Workshop on Discourse and Dialogue. (2004) 144–153
 8. Zhou, Z.H., Li, M.: Semi-supervised learning by disagreement. Knowl. Inf. Syst. 24(3)
    (2010) 415–439
 9. Riloff, E., Jones, R.: Learning dictionaries for information extraction by multi-level boot-
    strapping. In: AAAI/IAAI. (1999) 474–479
10. Chklovski, T., Mihalcea, R.: Exploiting agreement and disagreement of human annotators
    for word sense disambiguation. In: UNT Scholarly Works. UNT Digital Library (2003)
11. Gligorov, R., Hildebrand, M., van Ossenbruggen, J., Schreiber, G., Aroyo, L.: On the role of
    user-generated metadata in audio visual collections. In: K-CAP. (2011) 145–152
12. Leason, T.: Steve: The art museum social tagging project: A report on the tag contributor
    experience. In: Museums and the Web 2009:Proceedings. (2009)