<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Are Annotators' Word-Sense-Disambiguation Decisions Affected by Textual Entailment between Lexicon Glosses?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Silvie Cinková</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Vernerová</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Malostranské námeˇstí 25 118 00 Praha 1 Czech Republic ufal.mff.cuni.cz</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>1885</volume>
      <fpage>5</fpage>
      <lpage>14</lpage>
      <abstract>
        <p>We describe an annotation experiment combining topics from lexicography and Word Sense Disambiguation. It involves a lexicon (Pattern Dictionary of English Verbs, PDEV), an existing data set (VPS-GradeUp), and an unpublished data set (RTE in PDEV Implicatures). The aim of the experiment was twofold: a pilot annotation of Recognizing Textual Entailment (RTE) on PDEV implicatures (lexicon glosses) on the one hand, and, on the other hand, an analysis of the effect of Textual Entailment between lexicon glosses on annotators' Word-SenseDisambiguation decisions, compared to other predictors, such as finiteness of the target verb, the explicit presence of its relevant arguments, and the semantic distance between corresponding syntactic arguments in two different patterns (dictionary senses).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>A substantial proportion of verbs are perceived as highly
polysemous. Their senses are both difficult to determine
when building a lexicon entry and to distinguish in context
when performing Word Sense Disambiguation (WSD). To
tackle the polysemy of verbs, diverse lexicon designs and
annotation procedures have been deployed. One
alternative way to classic verb senses (e.g. to blush - to redden, as
from embarrasment or shame1) is usage patterns coined in
the Pattern Dictionary of English Verbs (PDEV) [9], which
will be explained in Section 2.2. Previous studies [3], [4]
have shown that PDEV represents a valuable lexical
resource for WSD, in that annotators reach good
interannotator agreement despite the semantically fine-grained
microstructure of PDEV. This paper focuses on cases
challenging the interannotator agreement in WSD and
considers the contribution of textual entailment (Section 2.3) to
interannotator confusion.</p>
      <p>
        We draw on a data set based on PDEV and annotated
with graded decisions (cf. Section 2.4) to investigate
features suspected of blurring distinctions between the
patterns [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We have been preliminarily considering features
related to language usage independently of the lexicon
design, such as finiteness and argument opacity of the target
      </p>
      <sec id="sec-1-1">
        <title>1http://www.dictionary.com/browse/blush</title>
        <p>verb on the one hand, and those related to the
lexicographical design of PDEV, such as semantic relations between
implicatures within a lemma or denotative similarity of the
verb arguments, on the other hand (see Section 3 for
definitions and examples).</p>
        <p>This paper focuses on a feature related to PDEV’s
design (see Section 2.3), namely on textual entailment
between implicatures in pairs of patterns of the same lemma
entry (henceforth colempats, see Section 3.1 for definition
and more detail).</p>
        <p>We pairwise compare all colempats, examining their
scores in the graded decision annotation with respect to
how much they compete to become the most appropriate
pattern, as well as the scores of presence of textual
entailment between their implicatures. To quantify the
comparisons, we have introduced a measure of rivalry for each
pair. The more the rivalry increases, the more
appropriate both colempats are considered for a given KWIC2 and
the more similar their appropriateness scores are (see
Section 3.2).</p>
        <p>We confirm a significant positive association between
rivalry in paired colempats and textual entailment between
their implicatures.
2
2.1</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <sec id="sec-2-1">
        <title>Word Sense Disambiguation</title>
        <p>
          Word Sense Disambiguation (WSD)[
          <xref ref-type="bibr" rid="ref8">18</xref>
          ] is a traditional
machine-learning task in NLP. It draws on the
assumption that each word can be described by a set of word
senses in a reference lexicon and hence each occurrence
of a word in a given context can be assigned a word sense.
Bulks of texts have been manually annotated with word
senses to provide training data. Nevertheless, the
extensive experience from many such projects has revealed that
even humans themselves do not do particularly well
interpreting word meaning in terms of lexicon senses,
despite specialized lexicons designed entirely for this task:
the English WordNet [8], PropBank [
          <xref ref-type="bibr" rid="ref4">14</xref>
          ], and OntoNotes
2KWIC = key word in context: a corpus line containing a match to
a particular corpus query
Word Senses [
          <xref ref-type="bibr" rid="ref2">12</xref>
          ], to name but a few. Although the
annotators, usually language experts, have neither
comprehension problems nor are they unfamiliar with using lexicons,
their interannotator agreement has been notoriously low.
This in turn makes the training data unreliable as well as
the evaluation of WSD systems harder.
        </p>
        <p>
          Attempts have been made to increase the interannotator
agreement by testing each entry on annotators while
designing the lexicon [
          <xref ref-type="bibr" rid="ref2">12</xref>
          ], as well as word senses were
clustered post hoc on the other hand (e.g. [
          <xref ref-type="bibr" rid="ref7">17</xref>
          ]), but even
lexicographers have been skeptical about lexicons with
hardwired word senses for NLP([
          <xref ref-type="bibr" rid="ref3 ref5">13, 15</xref>
          ]).
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Pattern Dictionary of English Verbs (PDEV)</title>
        <p>The reasoning behind PDEV is that a verb has no meaning
in isolation; instead of word senses, it has a meaning
potential, whose diverse components and their combinations
are activated by different contexts. To capture the
meaning potential of a verb, the PDEV lexicographer manually
clusters random KWICs into a set of prototypical usage
patterns, considering both their semantic and
morphosyntactic similarity. Each PDEV pattern contains a pattern
definition (a finite clause template where important
syntactic slots are labeled with semantic types) and an
implicature to explain or paraphrase its meaning, which also
is a finite clause (Fig. 1). The PDEV implicature
corresponds to gloss or definitionin traditional dictionaries.</p>
        <p>
          The semantic types (e.g. Human, Institution, Rule,
Process, State_of_Affairs) are the most typical syntactic slot
fillers, although the slots can also contain a set of
collocates (a lexical set) and semantic roles complementary to
semantic types. The semantic types come from an
approximately 250-item shallow ontology associated with
PDEV and drawing on the Brandeis Semantic Ontology
(BSO), [
          <xref ref-type="bibr" rid="ref9">19</xref>
          ]. The notion of semantic types, lexical sets,
and semantic roles (altogether dubbed semlabels) is, in
this paper, particularly relevant for Section 3.5.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Recognizing Textual Entailment (RTE)</title>
        <p>Recognizing Textual Entailment (RTE) is
a computational-linguistic discipline coined by
Dagan et al. [5]. The task of RTE is to determine, “given
two text fragments, whether the meaning of one text can
be inferred (entailed) from another text. More concretely,
the applied notion of textual entailment is defined as a
directional relationship between pairs of text expressions,
denoted by T the entailing ‘text’ and by H the entailed
‘hypothesis’. We say that T entails H if, typically, a
human reading T would infer that H is most probably
true”. So, for instance, the text Norway’s most famous
painting, ‘The Scream’ by Edvard Munch, was recovered
yesterday, almost three months after it was stolen from
an Oslo museum entails the hypothesis Edvard Munch
painted ‘The Scream’ [5].
2.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Graded Decisions on Verb Usage Patterns:</title>
      </sec>
      <sec id="sec-2-5">
        <title>VPS-GradeUp</title>
        <p>The VPS-GradeUp data set draws on Erk’s experiments
with paraphrases (USim)[7]. VPS-GradeUp consists
of both graded-decision and classic-WSD annotation of
29 randomly selected PDEV lemmas: seal, sail,
distinguish, adjust, cancel, need, approve, conceive, act, pack,
embrace, see, abolish, advance cure, plan, manage,
execute, answer, bid, point, cultivate, praise, talk, urge, last,
hire, prescribe, and murder. Each lemma comes with
50 KWICs processed by three annotators3 in parallel.</p>
        <p>
          In the graded-decision part, the annotators judged each
pattern for how well it described a given KWIC, on a
Likert scale4. In the WSD part, each KWIC was assigned one
best-matching pattern. The entire data set contains WSD
judgments on 1,450 KWICs, corresponding to 11,400
graded decisions (50 sentences × 29 lemmas × sum of
patterns). A more detailed description of VPS-GradeUp is
given by Baisa et al.[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>Fig. 2 shows a VPS-GradeUp sample of three KWICs
of the verb abolish (see Fig. 1 to refer to the lexicon entry).
Columns 1, 2, and 3 identify the pattern ID, lemma, and
sentence ID, respectively. Columns 4-6 and 7-9 contain
the graded and WSD decisions by the three annotators,
respectively. Column 10 contains the annotated KWIC,
which for Sentence 1 reads: President Mitterrand said
yesterday that the existence of two sovereign German states
3linguists, professional but non-native English speakers
4Likert scale is a psychometric scale used in opinion surveys. It
enables the respondents to scale their agreement/disagreement with a given
opinion.
could not be ‘ABOLISHED at a stroke’. On the third
table row, Pattern 3 was judged as maximally appropriate by
Annotator 1 and 2; Annotator 3 gave one point less. In the
WSD part, Annotator 1 voted for Pattern 2, while
Annotators 2 and 3 preferred Pattern 3.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 Important Concepts</title>
      <p>3.1</p>
      <sec id="sec-3-1">
        <title>Lempats and Colempats</title>
        <p>To begin with, we introduce the concept of lempats and
colempats. The lemma-pattern combination, as
represented by Columns 1 and 2 in Fig. 2, is called lempat. All
lempats sharing a common lemma are called colempats.
That is, the table presents the colempats abolish_1,
abolish_2, and abolish_3 and their three annotator judgments
on the sentences 1.1, 2.1, and 3.1. A pair of patterns, such
as abolish_3 and cancel_1, are also two lempats which we
could compare, but they are not colempats, because each
belongs to a different lemma (abolish vs. cancel).</p>
        <p>Fig.2, Columns 4-6, shows that, on Sentence 1.1, the
annotators disagree in their WSD judgments (Annotator 2
and 3 voted for Pattern 3, but Annotator 1 preferred
Pattern 2). This is probably caused by the fact that
Annotators 1 and 2 had also regarded Pattern 2 as somewhat
appropriate (Row 2). Interestingly, Annotator 1 even
considered Pattern 1 maximally appropriate for the given KWIC,
unlike the others, but eventually did neither vote for this
pattern nor for Pattern 3. As with all manual annotations,
human error cannot be a priori dismissed, but even the
oddest judgments mostly turn out to come with a plausible
explanation.</p>
        <p>How do then the graded decisions map on the WSD
judgments, if they do at all? To perform quantitative
observations of how much two patterns compete in the WSD
annotation, we needed a measure of appropriateness of a
given pattern for a given KWIC across all annotators (see
Section 3.2), along with yet another measure to tell which
two patterns were the most serious competitors (rivalry,
see Section 3.3).</p>
        <p>Having a lempat, we need to measure its
appropriateness for a given KWIC. To be able to examine the mapping
between the graded-decisions and the WSD annotation,
we observe rivalry within each possible pair of colempats
for a given KWIC.
20
s
se15
n
e
it
a
r
p
ro10
p
p
a
5 ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●
● ● ● ● ●
●
●
111−− 131−− 122−− 222−− 333−− 444−− 555−− 666−− 777−−
possible judgment triples sorted according to appropriateness</p>
        <p>The function returns values ranging from 3 to 21. These
are all possible sums of judgments by three annotators on
7-point Likert scales: as a minimum and maximum, a
pattern can obtain 1 and 7 from each annotator, respectively.
The 3.5 coefficient is roughly the maximum standard
deviation (sd) possible with three judgments ranging from 1
to 7. Compared to mean or median, appropriateness
discounts triples with higher dispersion. We made no effort
to generalize this measure beyond the specific setup of this
particular experiment with 7-point Likert scales and three
annotators, and therefore the x value must be a natural
number ranging from 1 to 7 and the sum must be the sum
of exactly 3 such x.</p>
        <p>Fig. 3 shows the shape of the curve. The x-axis contains
all possible combinations of 1–7 triples with replacement,
sorted in ascending order according to their corresponding
appropriateness value. The curve is designed to reflect the
opinion strength by steepness: the extreme positions
indicate stronger opinions than central scale positions.
Therefore the dispersion of the judgments affects
appropriateness more strongly at both ends of the scale than around
its center.
To compare the competition between PDEV pairs of
patterns, we have introduced rivalry. Rivalry always concerns
the appropriateness rates for a pair of patterns of one
lemma (colempats), being computed for all pairs. Rivalry
increases with the appropriateness of each colempat and
with decreasing difference between the appropriateness
values in the given colempat pair: the higher the rivalry,
the more the two patterns compete for becoming selected
as the best match in the WSD annotation. The rivalry
function is simple:
Rivalry = max(apprpair) − (max(apprpair) −
min(apprpair)) = min(apprpair).</p>
        <p>Under apprpair we understand the two computed
appropriateness values of patterns in a colempat pair:
max(appr) and min(appr). They represent the higher and
the lower appropriateness, respectively. Hence, rivalry is
defined as the difference between the higher
appropriateness value and the difference between that and the lower
appropriateness value, which boils down to the lower
appropriateness. The idea behind rivalry is that, given the
nature of the WSD annotation task, we are interested in
colempats competing at the positive rather than at the
negative end of the scale.</p>
        <p>It is to be emphasized that rivalry is always computed
on a given KWIC. Hence we cannot immediately tell e.g.
the rivalry between abandon_1 and abandon_3 in general,
but we get one rivalry value of this colempat pair for each
of the 50 KWICs.</p>
        <p>Measuring rivalry is interesting, even though we have
not yet abstracted from individual KWICs; it enables us
to identify cases of pattern overlap for further analysis of
both the design of the patterns and of contextual features
in the KWICs affected.
3.4</p>
      </sec>
      <sec id="sec-3-2">
        <title>Corresponding Synslots</title>
        <p>As Fig. 1 shows, the syntactic slot fillers of the target verb
in the pattern definition are described by semantic labels
(henceforth semlabels). Each syntactic slot (henceforth
synslot) also has a syntactic function in the clause:
subject, object, adverbial, or complement. When observing
synslots across a pair of colempats, we check whether a
synslot with a particular syntactic function (e.g. object)
is present in both colempats in the pair. When this is the
case, these two synslots are called corresponding synslots.
3.5</p>
      </sec>
      <sec id="sec-3-3">
        <title>Semantic Distance between Corresponding</title>
      </sec>
      <sec id="sec-3-4">
        <title>Synslots</title>
        <p>
          In a past experiment, we measured how the rivalry is
impacted by the extent to which the sets of synslot fillers
in a colempat pair are cognitively similar. We observed
a statistically significant (yet weak) positive association.
The synslot fillers were represented by the semlabels. To
obtain their semantic similarity, we first built a corpus of
pattern definitions and implicatures from the entire PDEV.
Then we fed this corpus to a neural network, which
created a vector representation for each word.5 We defined
5text2vec [22] – an implementation of the word2vec [
          <xref ref-type="bibr" rid="ref6">16</xref>
          ] neural
network for R. The original task on which the neural network was trained
was guessing context around each word. Its practical use draws on the
so-called Distributional Hypothesis[10], according to which words with
similar context distribution are more semantically related than those with
dissimilar context distribution. The network creates a vector
representation of each word, with the dimensions of each word vector being the
other words. The similarity of two vectors reflects the distributional (and
hence semantic) similarity of two words.
the mutual similarity of each two words by the cosine
similarity of their vectors. For more details see [2].
3.6
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>Verb Finiteness</title>
        <p>Finiteness is a morphosyntactic category associated with
verbs. Virtually all verbs appear in finite as well as
infinite forms when used in context. A finite verb form is
such a verb form that expresses person and number.
Languages differ in whether these categories are expressed
morphologically (e.g. by affixes or stem vowel changes) or
syntactically (obligatorily complemented with a
noun/pronoun expressing these categories explicitly). Finite forms
are typically all indicative and conditional forms, as well
as some imperative forms, e.g. reads, are reading, (they)
read, cˇtu, cˇteˇte, chteˇl by, gehst, allons!. Infinite forms are
infinitives (to read, to have read, to be heard, to have been
heard) and participles along with gerunds and supines
(reading, known, deleted, försvunnit). The grammars of
many languages know diverse other finite as well as
infinite verb forms. Infinite forms typically allow more
argument omissions than finite forms: to go to town vs.
*went to town (incorrect). This suggests that descriptions
of events rendered by infinite verb forms may be more
vague, and, in terms of annotation, more prone to match
several different patterns/senses at the same time. Verb
finiteness is easy to determine, and therefore it was only
annotated by one annotator in our data set.
3.7</p>
      </sec>
      <sec id="sec-3-6">
        <title>Argument Opacity</title>
        <p>Argument opacity typically, but not necessarily, relates to
verb finiteness. By argument opacity we mean how many
arguments relevant for disambiguation of the target verb
are either omitted in the context (e.g. subject in infinitive)
or ambiguous or vague. Ambiguous and vague arguments
are often arguments expressed by personal pronouns that
refer to entities mentioned distantly from the target verb,
sometimes even not directly, but by longer chains of
pronouns (so-called coreference or anaphora chains), or
arguments expressed by indefinite or negative pronouns. Some
examples of opaque verb contexts follow:</p>
        <p>The Greater London Council was ABOLISHED in 1986.
(Who abolished it?)</p>
        <p>The company’s ability to adapt to new opportunities
and capitalize on them depends on its capacity to share
information and involve everyone in the organization in a
systemwide search for ways to improve, ADJUST, adapt,
and upgrade . (Who exactly adjusts what?)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Textual Entailment Annotation</title>
      <sec id="sec-4-1">
        <title>Annotation Procedure</title>
        <p>Three annotators6 obtained paired implicatures of
colempats of each target verb and judged whether one entailed
the other (specifying the direction), or whether the
entailment is bidirectional or absent (cf. Section 2.3). The
definition of entailment used here is based on the conception
of textual entailment coined by Dagan et al. (cf. RTE, [6]).
For the purposes of this paper, we collapsed the annotation
into entailment presence-absence judgments.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Annotation Results</title>
        <p>The three annotators processed 1,091 implicature pairs
(both implicatures always belonged to the same lemma).
The annotators were allowed to see the entire entry
including example sentences, but they were told to focus
on the implicatures. Their pairwise percentual agreement
scores were 73.8, 74.6, and 83.3. Fleiss’ kappa was
moderate: 0.41. While RTE annotations usually reach 0.6 desired
for semantic annotations, our worse result is
understandable: the PDEV implicatures are much more abstract and
hence more vague than regular text, since the arguments
of the target verb are described by ontology labels. See an
example of two pattern implicatures of the verb seal:
Human covers the surface of Artifact with Stuff.
Human encloses Physical Object in an airtight
Container.</p>
        <p>We merged the three annotations by taking the means
of “yes” and “no” judgments replaced with 1 and 0,
respectively. With this setup, the judgments could acquire
only four values: 0, 0.33, 0.66, and 1. We treated them as
values of a categorical ordinal variable.</p>
        <p>Fig. 4 shows the annotation results for each lemma. To
facilitate the reading, we displayed the judgments as the
number of annotator votes in favor of entailment. The
proportions are compared to the verb see, with its 192
colempat pairs. The annotator disagreement is represented by
1 and 2 votes. In terms of proportions within the given
lemma, the most problematic verbs were the small7 verbs
abolish, cancel, hire, and praise, along with the large verbs
act, point, and talk.</p>
        <p>A typical colempat pair with full agreement on no
implicature entailment is e.g. act_10-12. The example also
includes the pattern definition for better understanding:
Pattern: Phrasal verb. Human acts Event or
Human Role or Emotion out.</p>
        <p>Implicature: Human performs Role, not necessarily
sincerely, or behaves as if feeling Emotion.</p>
        <p>6linguists familiar with PDEV as well as with RTE, professional but
non-native English speakers
7i.e. with a small number of colempat pairs
Pattern: Idiom. Human acts POSDET age.</p>
        <p>Implicature: Human behaves in a manner appropriate to
their age.</p>
        <p>Although both these events have something to do with
behavior, we can neither normally assume that someone who
acts their emotions out is necessarily behaving according
to their age, nor the other way round. Thus we observe no
implicature entailment relation between these two
colempats.</p>
        <p>A typical colempat pair with full agreement on
implicature entailment is e.g. act_1-9. This example also
illustrates that entailment does not require synonymy.
The second implicature entails the first; that is, when
an actor performs a character on theater, they are –
normally – pursuing a motivated action by pretending to
be a particular character for their audience.</p>
        <p>Pattern: Human or Institution or Animal or
Machine acts
Implicature: Human or Institution or Animal or Machine
= Agent performs a motivated Action
Pattern: Human acts (Role) (in Performance)
Implicature: Human plays Role = Theatrical (in
Performance)
However, the general nature of the implicatures makes the
entailment annotation difficult. Below follows an example
where one annotator voted against the entailment, the
act_1-11 pair. The act_1 colempat is listed in the previous
example. Here follows the act_11 colempat:</p>
        <sec id="sec-4-2-1">
          <title>Pattern: Phrasal verb. Human acts up.</title>
          <p>Implicature: Human behaves badly. Human is typically
a naughty child..</p>
          <p>The annotators clearly disagree on whether bad
behavior is normally perceived as a motivated action.
They were instructed to focus only on the implicature.
At the same time, they were allowed to see the entire
entry. Most likely with this entry, two annotators were
influenced by the very verb act up. The verb act up
suggests a motivated action (e.g. start screaming to attract
attention, this being perceived as bad manners in the
given situation). The plain implicature leaves leeway
for considering non-motivated actions (can very young
infants act consiously?) or non-actions perceived as bad
behavior (even a child can behave badly by not acting e.g.
to someone’s help).</p>
          <p>The reasons for annotator disagreements are very
diverse, including obvious annotation errors, and their ex
post analysis is often subjective. We show a case from still
the same verb, act_1-12. See act_1 above again, act_12
follows:
abolish
act
adjust
advance
answer
approve
bid
cancel
conceive
cultivate
cure
distinguish
embrace
execute
hire
last
manage
murder
need
pack
point
praise
prescribe
plan
see
sail
seal
talk
urge</p>
          <p>Votes for entailment
0
1
2
3
Here, the pro-entailment decision by two annotators was
most likely motivated by the fact that act_1
specifiesMachine as Agent and lets it perform a motivated action.
Then, naturally, even malfunction can be a motivated
action. The remaining annotator, on the other hand, did not
accept malfunction as a motivated action.</p>
          <p>Often the uncertainty lies in the interpretation of the
semantic types. For instance, hire_1-2 differ in the object
of hiring. In the first colempat it is Human or Institution,
whose services are obtained for payment. In the second
colempat it is a Physical Object, which is used for an
agreed period of time against payment. In real life, this
corresponds to e.g. hiring a gardener to take care of a
garden vs. hiring an apartment. Such two events naturally
do not entail each other in any way. However, the general
wording of implicatures allows one annotator to regard the
use of a Physical Object against payment as a service
provided by a Human or Institution. Consider e.g. Mary hires
John to let her live in an apartment that belongs to him..
5
5.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Association between Implicature</title>
    </sec>
    <sec id="sec-6">
      <title>Entailment and Rivalry</title>
      <sec id="sec-6-1">
        <title>Linear Model with Rivalry Abstracted from</title>
      </sec>
      <sec id="sec-6-2">
        <title>Individual KWICs</title>
        <p>While the textual entailment is observed between two
colempat implicatures independently of their instances in
corpus evidence, rivalry is always associated with both the
given pair of colempats and the KWIC, with respect to
which their appropriateness was judged (cf. Section 3.3).
We had 50 rivalry scores for each colempat pair, since the
VPS-GradeUp annotators were judging the
appropriateness of each pattern for each of the 50 KWICs per lemma.
For each colempat pair we selected the KWIC on which
their rivalry was highest.</p>
        <p>To examine the association between the textual
entailment between implicatures and rivalry between colempats,
we built this linear regression model using the lm()
function in base R [20].</p>
        <p>C a l l :
lm ( formula = a b s t r _ r i v a l r y ~ f a c t o r ( numMeans ) ,
data = a l l _ e n t a i l m e n t )</p>
        <p>According to the Adjusted R-squared, it explains
approximately 35% of the variance of rivalry. This means
that entailment is quite a strong predictor. Apart from
that, the individual coefficient values in the model nicely
confirm our assumption that entailment causes rivalry
increase: One vote for entailment (i.e. value 0.3)
increases the rivalry coefficient by 0.04, two votes increase
it by 0.08, and three votes increase it by 0.15. Their
individual standard errors are one decimal point smaller
than the coefficients themselves, which means that they
would not overlap; that is, every single entailment vote
matters. The model is highly significant, and so are all
levels of the entailment values (p-value always much smaller
than 0.05). This, along with the randomness of lemma
selection, means that we can expect the results to be similar
with other equally annotated verbs.
5.2</p>
      </sec>
      <sec id="sec-6-3">
        <title>Linear Model with Rivalry on All KWICs</title>
        <p>We ran the same experiment also without abstracting from
the KWICs. The model is still highly significant, but
extremely weak (explaining about 20% of the rivalry
variation). This makes sense, since this time we also
included observations with the same entailment conditions
but lower rivalry. This way we introduced KWICs where
the positive effect of entailment can have been overcome
by the negative effect of other predictor values, which we
have not included into the model.</p>
        <p>C a l l :
lm ( formula = r i v a l r y ~ f a c t o r ( e n t a i l _numMeans ) ,
data = v y p l y v )
R e s i d u a l s :</p>
        <p>Min 1Q Median 3Q Max
−3.8233 −0.8442 −0.5540 0 . 2 8 1 1 1 5 . 1 5 5 8
C o e f f i c i e n t s :
We have observed a statistically significant positive effect
of textual entailment of colempat implicatures on the
rivalry between colempats in PDEV. It is evidently not the
only cause of increasing rivalry, as shown by the
weakness of the model, but has the strongest effect. The
implicature is the part of patterns that corresponds to classic
word senses in traditional lexicons. This suggests that the
traditional conception of word senses as semantic
definitions rather than usage definitions is very useful in sense
distinction, whenever annotators agree. On the other hand,
like with traditional word senses, the interannotator
agreement is low. Like traditional word senses shaped as
lexicon glosses/definitions, the implicatures are too abstract to
bode well for interannotator agreement. The issue persists
even when the annotation task is set up as an RTE task
rather than recognizing synonymy and mutual exclusivity
(according to which traditional WSD annotation decisions
are taken)8.</p>
        <p>Apart from the textual entailment, we have been
preliminarily examining other features suspect of increasing
rivalry, such as the explicit presence/absence of relevant
arguments (argument opacity, Section 3.7), semantic
distance between labels used in corresponding syntactic
positions within a colempat pair (based on text2vec [22]), and
finiteness of the target verb in the KWICs (Section 3.6).
A statistically significant linear model predicting rivalry
finds all these predictors significant (Fig. 5).</p>
        <p>However, the textual entailment turns out to be most
effective rivalry increaser, raising each rivalry unit by 2.55
(to the extent we can believe averaged human judgments
on implication). Interestingly, verb finiteness (promising
more explicit contexts) does not help distinguish between
patterns but in fact increases rivalry (i.e. blurs distinctions
between colempats). Considering the argument opacity,
opaque object is the most rivalry increasing predictor from
the opacity family (coeff. 1.42). We have also been
considering the factuality9 of the events described by the
tar8The RTE annotation task would possibly benefit from graded
annotation by many annotators like word-similarity/relatedness experiments,
e.g. [11].</p>
        <p>9[21]</p>
        <p>Call:
Residuals:
Coefficients:
lm(formula = rivalry ~ w2vec_hsdrff_Sum + z_finite + z_args.opaque
+ entail_mean, data = rival)
get predicates (for which we have used verb finiteness here
as a primitive proxy), but a pilot annotation has yielded
poor interannotator agreement, making results based on
such data even more speculative than those of textual
entailment between colempat implicatures, so we have not
included it in the model.</p>
        <p>All the aforementioned predictors are apparently not
general enough to beat the effects of individual lemmas:
most lemmas are significant, have high coefficients, and
increase the predictive power of the model in Fig. 6; cf.
R-squared in both models: despite efforts to find
universal linguistic features, each verb appears to remain a little
universe in its own right.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>We have confirmed that textual entailment between two
colempat implicatures increases rivalry between these
colempats. We also see that the more the annotators agree
on the presence of entailment, the stronger its effect is: it
grows with each annotator vote to even double when all
three annotators agree, compared to two annotators.
8</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>This work was supported by the Czech Science
Foundation Grant No. GA 15-20031S and by the
LINDAT/CLARIN project No. LM2015071 of the MEYS CR.
[2] Silvie Cinková and Zdeneˇk Hlávka. Modeling
Semantic Distance in the Pattern Dictionary of English</p>
      <p>Verbs. Jazykovedný cˇasopis, to appear.
[3] Silvie Cinková, Martin Holub, Adam Rambousek,
and Lenka Smejkalová. A database of semantic
clusters of verb usages. In Proceedings of the 8th
International Conference on Language Resources and
Evaluation (LREC 2012), pages 3176–3183,
Istanbul, Turkey, 2012. European Language Resources</p>
      <p>Association.
[4] Silvie Cinkova, Ema Krejcˇová, Anna Vernerová, and</p>
      <p>Vít Baisa. Graded and Word-Sense-Disambiguation
Decisions in Corpus Pattern Analysis: a Pilot Study.</p>
      <p>In Nicoletta Calzolari (Conference Chair), Khalid
Choukri, Thierry Declerck, Marko Grobelnik, Bente
Maegaard, Joseph Mariani, Asuncion Moreno, Jan
Odijk, and Stelios Piperidis, editors, Proceedings of
the Tenth International Conference on Language
Resources and Evaluation (LREC 2016), Paris, France,
2016. European Language Resources Association
(ELRA).
[5] Ido Dagan, Bill Dolan, Bernardo Magnini, and Dan</p>
      <p>Roth. Recognizing textual entailment: Rational,
evaluation and approaches. Natural Language
Engineering, 15(4):i–xvii, 2009.
[6] Ido Dagan, Dan Roth, Mark Sammons, and</p>
      <p>Fabio Massimo Zanzotto. Recognizing Textual
Entailment: Models and Applications. Synthesis
Lectures on Human Language Technologies. Morgan &amp;</p>
      <p>Claypool Publishers, 2013.
[7] Katrin Erk, Diana McCarthy, and Nicholas Gaylord.</p>
      <p>Investigations on Word Senses and Word Usages. In
Proceedings of the Joint Conference of the 47th
Annual Meeting of the ACL and the 4th International
Joint Conference on Natural Language Processing of
the AFNLP, pages 10–18, Suntec, Singapore, August
2009. Association for Computational Linguistics.
[8] C. Fellbaum, J. Grabowski, and S. Landes.
Performance and confidence in a semantic annotation task.</p>
      <p>In WordNet: An Electronic Lexical Database, pages
217–238. Cambridge (Mass.): The MIT Press.,
Cambridge (Mass.), 1998. 00054.
[9] Patrick Hanks. Pattern Dictionary of English Verbs.</p>
      <p>http://pdev.org.uk/, UK, 2000.
[10] Zellig Harris. Distributional structure.</p>
      <p>23(10):146–162, 1954. 01136.</p>
      <p>Word,
[11] Samer Hassan and Rada Mihalcea. Cross-lingual
semantic relatedness using encyclopedic knowledge.</p>
      <p>In In EMNLP 2009. Association for Computational
Linguistics, 2009. 00044.
[20] R Core Team. R: A Language and Environment for</p>
      <p>Statistical Computing. R Foundation for Statistical</p>
      <p>Computing, 2014.
[21] Roser Saurí and James Pustejovsky. Are You</p>
      <p>Sure That This Happened? Assessing the
Factuality Degree of Events in Text. Comput. Linguist.,
38(2):261–299, June 2012. 00068.
[22] Dmitriy Selivanov. text2vec: Modern Text Mining</p>
      <p>Framework for R. R Foundation for Statistical
Computing, 2016.</p>
      <p>Call:
lm(formula = rivalry ~ w2vec_hsdrff_Sum + z_finite + z_args.opaque
+ entail_mean + lemmas, data = rival)
Residuals:
Min 1Q Median 3Q Max
-59 380 -0.7319 -0.1572 0.1800 159 160
Coefficients:
(Intercept) 73522017 0.1532486 47 976
w2vec_hsdrff_Sum -0.0003296 0.0011026 -0.299
z_finitey 0.1455412 0.0161223 9 027
z_args.opaquey 0.5009538 0.2156543 2 323
z_args.opaqueobj 0.2547546 0.3372427 0.755
z_args.opaquesubj 0.0532390 0.0217931 2 443
entail_mean 1.8818343 0.0217515 86 515
lemmasact -3.7824581 0.1512543 -25 007
lemmasadjust -2.7990281 0.1619012 -17 288
lemmasadvance -4.2765385 0.1515831 -28 213
lemmasanswer -4.2405515 0.1508514 -28 111
lemmasapprove -3.1989511 0.1621494 -19 728
lemmasbid -3.9934306 0.1548404 -25 791
lemmascancel -2.8219473 0.1621358 -17 405
lemmasconceive -2.2675897 0.1583548 -14 320
lemmascultivate -2.7869641 0.1816034 -15 346
lemmascure -3.8352304 0.1688616 -22 712
lemmasdistinguish -2.9855282 0.1580461 -18 890
lemmasembrace -3.3944366 0.1624320 -20 898
lemmasexecute -2.2898572 0.1686455 -13 578
lemmashire -3.4752011 0.2089821 -16 629
lemmaslast -1.2512805 0.2101987 -5 953
lemmasmanage -2.9204488 0.1531206 -19 073
lemmasmurder -3.5778433 0.2101611 -17 024
lemmasneed 0.2515703 0.1692646 1 486
lemmaspack -4.3029164 0.1501447 -28 658
lemmasplan -1.7058389 0.1817877 -9 384
lemmaspoint -3.2865632 0.1512721 -21 726
lemmaspraise -0.2847921 0.2091486 -1 362
lemmasprescribe -0.2380621 0.2091980 -1 138
lemmassail -1.8942963 0.1567161 -12 087
lemmasseal -3.8569221 0.1581722 -24 384
lemmassee -4.3824168 0.1498710 -29 241
lemmastalk -3.7339660 0.1502380 -24 854
lemmasurge -0.8541827 0.1623112 -5 263
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.809 on 54503 degrees of freedom
Multiple R-squared: 0.3497, Adjusted R-squared: 0.3493
F-statistic: 862.2 on 34 and 54503 DF, p-value: &lt; 2.2e-16
&lt; 2,00E-16
0.7650
&lt; 2,00E-16
0.0202
0.4500
0.0146
&lt; 2,00E-16
&lt; 2,00E-16
&lt; 2,00E-16
&lt; 2,00E-16
&lt; 2,00E-16
&lt; 2,00E-16
&lt; 2,00E-16
&lt; 2,00E-16
&lt; 2,00E-16
&lt; 2,00E-16
&lt; 2,00E-16
&lt; 2,00E-16
&lt; 2,00E-16
&lt; 2,00E-16
&lt; 2,00E-16
2.65e-09
&lt; 2,00E-16
&lt; 2,00E-16
0.1372
&lt; 2,00E-16
&lt; 2,00E-16
&lt; 2,00E-16
0.1733
0.2551
&lt; 2,00E-16
&lt; 2,00E-16
&lt; 2,00E-16
&lt; 2,00E-16
1.43e-07
***
***
*
*
***
***
***
***
***
***
***
***
***
***
***
***
***
***
***
***
***
***
***
***
***
***
***
***
***
***</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Vít</given-names>
            <surname>Baisa</surname>
          </string-name>
          , Silvie Cinková, Ema Krejcˇová, and Anna Vernerová.
          <article-title>VPS-GradeUp: Graded Decisions on Usage Patterns</article-title>
          .
          <source>In LREC 2016 Proceedings</source>
          , Portorož, Slovenia, May
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Eduard</given-names>
            <surname>Hovy</surname>
          </string-name>
          , Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel.
          <article-title>OntoNotes: the 90% solution</article-title>
          .
          <source>In Proceedings of the Human Language Technology Conference of the NAACL</source>
          , Companion Volume:
          <article-title>Short Papers</article-title>
          , NAACL-Short '
          <volume>06</volume>
          , pages
          <fpage>57</fpage>
          -
          <lpage>60</lpage>
          , Stroudsburg, PA, USA,
          <year>2006</year>
          . Association for Computational Linguistics.
          <volume>00346</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Adam</given-names>
            <surname>Kilgarriff</surname>
          </string-name>
          .
          <article-title>"I Don't Believe in Word Senses"</article-title>
          .
          <source>Computers and the Humanities</source>
          ,
          <volume>31</volume>
          (
          <issue>2</issue>
          ):
          <fpage>91</fpage>
          -
          <lpage>113</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Karin</surname>
            <given-names>Kipper</given-names>
          </string-name>
          , Anna Korhonen, Neville Ryant, and
          <string-name>
            <given-names>Martha</given-names>
            <surname>Palmer</surname>
          </string-name>
          .
          <article-title>A large-scale classification of English verbs</article-title>
          .
          <source>Language Resources and Evaluation</source>
          ,
          <volume>42</volume>
          (
          <issue>1</issue>
          ):
          <fpage>21</fpage>
          -
          <lpage>40</lpage>
          ,
          <year>2008</year>
          .
          <volume>00139</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Ramesh</given-names>
            <surname>Krishnamurthy</surname>
          </string-name>
          and
          <string-name>
            <given-names>Diane</given-names>
            <surname>Nicholls</surname>
          </string-name>
          .
          <article-title>Peeling an Onion: The Lexicographer's Experience of Manual Sense Tagging</article-title>
          .
          <source>Computers and the Humanities</source>
          ,
          <volume>34</volume>
          :
          <fpage>85</fpage>
          -
          <lpage>97</lpage>
          ,
          <year>2000</year>
          .
          <volume>00000</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          ,
          <article-title>Wen tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations</article-title>
          .
          <source>In HLT-NAACL</source>
          , pages
          <fpage>746</fpage>
          -
          <lpage>751</lpage>
          . The Association for Computational Linguistics,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Navigli</surname>
          </string-name>
          .
          <article-title>Meaningful Clustering of Senses Helps Boost Word Sense Disambiguation Performance</article-title>
          .
          <source>In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL</source>
          , pages
          <fpage>105</fpage>
          -
          <lpage>112</lpage>
          , Sydney, Australia,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Navigli</surname>
          </string-name>
          .
          <article-title>Word sense disambiguation: A survey</article-title>
          .
          <source>ACM Comput. Surv.</source>
          ,
          <volume>41</volume>
          (
          <issue>2</issue>
          ):
          <volume>10</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          :
          <fpage>69</fpage>
          ,
          <year>February 2009</year>
          .
          <volume>00697</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [19]
          <string-name>
            <surname>James</surname>
            <given-names>Pustejovsky</given-names>
          </string-name>
          , Catherine Havasi,
          <string-name>
            <given-names>Jessica</given-names>
            <surname>Littman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Anna</given-names>
            <surname>Rumshisky</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Marc</given-names>
            <surname>Verhagen</surname>
          </string-name>
          .
          <article-title>Towards a generative lexical resource: The Brandeis Semantic Ontology</article-title>
          .
          <source>In Proceedings of the Fifth Language Resource and Evaluation Conference</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>