<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enriching a Lexical Semantic Net with Selectional Preferences by Means of Statistical Corpus Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andreas Wagner</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Broad-coverage ontologies which represent lexical semantic knowledge are being built for more and more natural languages. Such resources provide very useful information for word sense disambiguation, which is crucial for a variety of NLP tasks (e.g. semantic annotation of corpora, information retrieval, or semantic inferencing). Since the manual encoding of such ontologies is very labour-intensive, the development of (semi-)automatic methods for acquiring lexical semantic information is an important task. This paper addresses the automatic acquisition of selectional preferences of verbs by means of statistical corpus analysis. Knowledge about such preferences is essential for inducing thematic relations, which link verbal concepts to nominal concepts that are selectionally preferred as their complements. Several approaches for learning selectional preferences from corpora have been proposed in the last years. However, their usefulness for ontology building is limited. This paper introduces a modification of one of these methods (i.e. the approach of Li &amp; Abe [1]) and evaluates it by employing a gold standard. The results show that the modified approach is much more appropriate for the given task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The most important semantic relation in the above-mentioned
ontologies is hyponymy/hyperonymy. This relation constitutes a
hierarchical structuring of the different semantic concepts. In WordNet, for
example, the concept &lt;life form&gt;3 is a hyperonym of concepts like
&lt;animal&gt;, &lt;human&gt;, and &lt;plant&gt;. Other semantic relations are in
general not encoded exhaustively (or even not at all). However, they
also provide useful information for NLP tasks. One group of such
relations in EuroWordNet are thematic role relations. These relations
connect verbal concepts with nominal concepts which typically
occur as their complements. For example, the verbal concept &lt;eat&gt;
should have AGENT pointers to the nominal concepts &lt;human&gt;
and &lt;animal&gt;, and a PATIENT pointer to the concept &lt;food&gt;.
Thematic relations provide information about the selectional preferences
which verbs impose on their complements. This kind of information
is useful for lexical and syntactic disambiguation (cf. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]).
      </p>
      <p>As manual encoding of ontologies is very labour-intensive, (semi-)
automatic methods have been explored, particularly the extraction
of information from other existing lexical resources. However, such
resources are often not complete or not available at all. For example,
thematic relations are encoded in several language-specific wordnets
in EuroWordNet, but only for some minor portions of verb concepts
so that a mapping of these relations to another language does not
yield an exhaustive coverage.</p>
      <p>If appropriate lexical resources are missing, other means of
automatically acquiring lexical information have to be considered. One
possibility is the statistical analysis of corpora. This paper addresses
the usefulness of employing statistical methods for learning thematic
relations. In particular, I will investigate the acquisition of
selectional preferences that verbs impose on their complements.
Knowledge about selectional preferences is a prerequisite for encoding
thematic relations.</p>
      <p>
        Several approaches for the statistical acquisition of selectional
preferences (represented as WordNet noun classes) have been
proposed ([
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). As these approaches investigate corpora, i.e.
huge collections of sentences, they reveal preferences for
syntactic arguments. As thematic roles can have different syntactic
realizations, the preferences for syntactic complements have to be
mapped to the corresponding roles. This can be done manually or
(semi-)automatically, cf. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for some basic approaches to
solve this problem.
      </p>
      <p>
        If selectional preferences are gathered to supplement an ontology,
it is desirable (if not necessary) to find a representation for them
which is both empirically adequate (i.e. captures all and only the
preferred concepts) and as compact as possible. For example, it is not
desirable to introduce PATIENT relations from &lt;eat&gt; to all the food
Recently, broad-coverage, general-purpose lexical semantic
ontologies have become available and/or are being developed for a variety
of natural languages, e.g. WordNet [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and EuroWordNet [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
WordNet was developed at Princeton University for English and is widely
used in the NLP community. EuroWordNet is a multilingual
lexical semantic database which comprises WordNet-like ontologies for
eight European languages. These wordnets are connected to an
interlingual index so that a node in one language-specific wordnet can
be mapped to the corresponding node in another language-specific
wordnet. These resources capture the semantic properties of the most
common words in a language. In particular, they encode the different
senses of words (represented by the concept nodes of the ontology)
and the basic semantic relations between word senses like
hyperonymy, antonymy, etc. (represented by the edges of the ontology).2
Such resources contain useful information for word sense
disambiguation, which is a prerequisite for several NLP tasks like semantic
annotation of corpora, text analysis, information retrieval, or
semantic inferencing. Thus, the resources provide necessary information
for various kinds of NLP tools. Their intention is to capture
general, domain-independent knowledge which complements
domainspecific knowledge needed for a particular NLP system.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Information theoretic foundations</title>
      <p>
        Among the approaches for the acquisition of selectional preferences
mentioned above, only the work of Li &amp; Abe systematically
addresses the problem of appropriate generalization. Resnik [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] does
not determine a set of classes that represents selectional preferences
at all. Ribas [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] determines such a set by a simple greedy algorithm.
The impact of this algorithm for the generalization level of the
selected classes is undetermined. Li &amp; Abe obtain a set of classes that
form a partition of the corpus instances. They employ a theoretically
well-founded principle (Minimum Description Length) to find the
appropriate generalization level.
      </p>
      <p>In this section, I describe this method and the experiment I carried
out to test its behaviour.</p>
      <p>
        The approach in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is based on the Minimum Description Length
(MDL) Principle invented by J. Rissanen (cf. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]). This principle
(1)
(2)
(3)
(4)
concepts in the wordnet (&lt;meat&gt;, &lt;strawberry cake&gt;,...) because
this would be highly redundant and would not express any
generalization. One would rather want to find a class which subsumes all
the preferred classes (such as &lt;food&gt;). Of course, a class which is
so general that it also subsumes dispreferred classes is unacceptable
as well (e.g. &lt;entity&gt;). Thus, the problem is to find the appropriate
level of generalization. The “compactness desideratum” (find classes
which are not too specific) is particularly important for our task, the
extension of a semantic net. It is motivated from a practical point of
view (storage economy) as well as by conceptual considerations
(appropriate generalizations should be expressed; this is important for
applications like semantic inferencing).
      </p>
      <p>
        This paper is organized as follows: Section 2 examines the
suitability of the above-mentioned statistical methods for finding the
appropriate generalization level. I will describe in detail the Li &amp; Abe
approach [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which is explicitly intended for that task. I will report
an experiment which reveals an inherent problem of this approach
with respect to generalization. In section 3, I will introduce a
modification of the method to overcome this problem, which is indicated
by an analogous experiment. Section 4 describes a more
systematic evaluation of the alternative approaches against a gold standard
which I extracted from the EuroWordNet database. Section 5 gives a
conclusion and sketches future work.
e the shortest average code length by assigning dlog2 p(1x) bits to a
x p(x) sign with probability (cf. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). Thus, if one has a good
estiInformation theory deals with coding information as efficiently as
possible. In the framework of this discipline, information is usually
coded in bits. If one has to code a sequence of signs (in our case,
nouns which occur as the complement of a certain verb in a corpus),
the simplest way to do this would be to represent each sign by a bit
sequence of uniform length. However, if the probabilities of the
individual signs differ significantly, it is more efficient (with respect to
data compression) to assign shorter bit sequences to more probable
(and thus more frequent) signs and longer bit sequences to less
probable (and less frequent) signs. It can be shown that one can achieve
mation of the probability distribution which underlies the occurrence
of the signs, one can develop an efficient coding scheme (a mapping
between signs and bit sequences) based on this estimation.
2
      </p>
    </sec>
    <sec id="sec-3">
      <title>ACQUIRING SELECTIONAL PREFERENCES</title>
    </sec>
    <sec id="sec-4">
      <title>FROM CORPORA 2.2 The basic method</title>
      <p>+ its subclasses. A tree cut model (cut preference values) determines
depends on the assumption that learning can be seen as data
compression. The better one knows which general principles underlie a
given data sample, the better one can make use of them to encode
this sample efficiently. If one wants to encode a sample, one has to
encode (a) the probability model that determines a coding scheme,
and (b) the data themselves (by employing that coding scheme). The
MDL principle states that the best probability model is that which
achieves the highest data compression, i.e. which minimizes the sum
of the lengths of (a) and (b). (The length of (a) is called model
description length, the length of (b) data description length.) In our
case, a sample consists of the noun tokens that appear at a certain
syntactic argument slot (e.g. the direct object of a certain verb in the
examined corpus).</p>
      <p>Li &amp; Abe represent the selectional behaviour of a verb (with
respect to a certain argument) as a so-called tree cut model. Such a
model provides a horizontal cut through the noun hierarchy tree, so
that the classes that are located along the cut form a partition of the
noun senses covered by the hierarchy. Each class is assigned a
preference value. The preference value for a class in the cut is inherited by
a probability distribution over the sample (see below), and hence a
coding scheme. (Examples of tree cut models can be found in Tables
1–6.)</p>
      <p>As preference value, Li &amp; Abe estimate the so-called association
norm:</p>
      <p>For simplicity, it is assumed that all possible cuts have uniform
probability. Thus, Lcut is constant for all cuts. As we aim at
minimizing the description length, we can neglect this term.</p>
      <p>all word senses are represented by leaves. (These additional nodes
will be indicated by ‘REST::’ as they represent the “rest” of class
instances which is not captured by the subclasses.) To handle the DAG
issue, I “broke the DAG into a tree”. This means, if a node has more
than one parent, I virtually duplicated that node (and its descendants)
to maintain a tree structure. This solution has the disadvantage that
parts of the sample are artificially duplicated. I will work on a more
principled solution in the future.</p>
      <p>
        To eliminate noise, I introduced a threshold in the following way:
The algorithm compares possible cuts by traversing the hierarchy top
down. If a class with a probability smaller than 0.05 is encountered,
then the traversal stops, i.e. the descendants of that class are not
examined. This has also the advantage of limiting the search space.
(5)
(7)
jSj the cut) and is the sample size. For every class on the cut, the
L(M ) M mizes for a given (cf. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]).6
K M is the number of parameters in (i.e. the number of classes on
association norm is represented by log2jSj bits. This precision
mini
      </p>
      <p>The data description length is given by
M where pM is a probability distribution determined by (cf. section
2.1).</p>
      <p>If the tree cut is located near the root, then the model
description length will be low because the model contains only few classes.</p>
      <p>However, the data description length will be high because the code
for the data is based on the probability distribution of the classes in
the model, not on the real probability distribution of the individual
nouns. The greater the difference between the supposed distribution
and the real one, the longer the code. And the coarser the
classification is, the more the corresponding distribution pM deviates from the
real distribution. On the other hand, if the tree cut is located near the
leaves, the reverse is true: the fine-grained classification fits the data
well, resulting in a low data description length, but the great amount
of classes increases the model description length. Minimizing the
sum of these two description lengths yields a balance between
compactness (expressing generalizations) and accuracy (fitting the data)
of the model.</p>
      <p>
        To test the behaviour of the Li &amp; Abe approach with respect to
generalization, I applied it to acquire selectional preferences for the direct
object slot.7 I extracted verb–object instances from a portion of the
the British National Corpus (parts A–E; about 40 million words) with
Steven Abney’s CASS parser [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].8 This resulted in a sample of about
2 million verb–noun pairs. Then I applied the algorithm of Li &amp; Abe
to calculate the selectional preferences of 24 test verbs and manually
inspected the results.
      </p>
      <p>The experiment revealed a significant drawback of employing the
MDL principle for our task. It turned out that the frequency of the
examined verb in the sample has an undesirable impact on the
generalization level of the tree cut model: The algorithm tends to
overgeneralize (acquire a tree cut with few general classes) for infrequent
verbs and to under-generalize (acquire a tree cut with many specific
classes) for frequent verbs. This behaviour is an immediate
consequence of the MDL principle: If a large amount of data has to be
described, then the model cost Lmod does not contribute much to the
whole description length L. The gain of a complex model for
encoding the data outweighs the model cost. If, however, only few data
have to be described, then Lmod is much more significant for L: the
cost of encoding a complex model outweighs the gain for encoding
the data.</p>
      <p>However, this is not the desired behaviour. Generalization should
not be triggered by the sample size, but by the “semantic variety” of
the instances in the sample: Nouns like “apple”, “pear”, “strawberry”
should generalize to &lt;fruit&gt;. Further instances like “pork” or “cake”
should trigger generalization to &lt;food&gt;, and yet further instances
like “house” or “vessel” to &lt;physical object&gt;.</p>
      <p>To illustrate these considerations, let us look at the verbs “kill” ,
“murder”, and “assassinate”. Tables 1–3 show (parts of) the tree cut
models obtained for these verbs. For the rather frequent verb “kill”
(3352 occurrences), hyponyms of &lt;animal&gt; are acquired. These
classes are too specific; one would expect the class &lt;life form&gt;.</p>
      <p>In contrast, the tree cut model for the less frequent verb “murder”
(477 occurrences) is an over-generalization. This verb prefers the
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Implementational details</title>
      <p>
        In essence, I used the algorithm described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to obtain the tree cut
model. However, some modifications were necessary or useful for
practical reasons.
      </p>
      <p>Firstly, some WordNet specific problems had to be solved. The
algorithm requires that the class hierarchy is a tree where the leaves
represent the word senses and the inner nodes represent semantic
classes. However, WordNet is not a pure tree, but a DAG, and all
nodes represent both word senses and semantic classes (e.g. the node
&lt;person#individual#someone&gt; represents at the same time a
semantic class and a particular word sense for the nouns “person”,
“individual”, and “someone”. No hyponym of the class represents
this sense. To handle this problem, I introduced for every inner node
an additional node that captures the noun sense that the node
represents and made this additional node a hyponym of the inner node. So
2.4
2.4.1</p>
    </sec>
    <sec id="sec-6">
      <title>Experiment</title>
      <sec id="sec-6-1">
        <title>Setting</title>
        <p>2.4.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Results</title>
        <p>A(class; verb)</p>
        <p>...
3.11</p>
        <p>...
31.42
10.93
9.02
11.36
5.44
6.08
31.42
109.96
9.43
88.81
15.56
178.59
11.28
11.95
150.40
20.95
18.21
31.42
364.85
343.82
0.14</p>
        <p>...
19.47</p>
        <p>...</p>
        <p>(8)
more specific concept &lt;person&gt;. The over-generalization is even
worse for the infrequent verb “assassinate” (79 occurrences). The
selectional preference of this verb is even more specific; it prefers a
concept like “important person” (which does not exist in WordNet).</p>
        <p>However, one of the most general concepts, &lt;entity&gt;, is retrieved.
+ mize by a weighting factor: Instead of minimizing Lpar Ldat, the
“grows faster” then Lpar, and for frequent verbs, the model
description length can be neglected, so that a model with many specific
classes becomes “affordable”.</p>
        <p>To overcome this drawback, I extended the expression to
minimodified algorithm minimizes
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>THE WEIGHTING ALGORITHM</title>
    </sec>
    <sec id="sec-8">
      <title>Introducing a weighting factor</title>
      <p>class
...
&lt;life form#organism#being#living thing&gt;
...</p>
      <p>C The value of the constant influences the degree of
generalizajSj Now both addends have the same complexity. does not directly
C tion. The smaller is, the more general classes are acquired. The
C choice of introduces some flexibility which might prove useful
affect generalization any more.
possibility of manipulating the overall generalization level by the
when the algorithm is applied in different situations (tasks, domains,
languages, etc.).</p>
      <p>Note that the introduction of weighting is a deviation from the
“pure” MDL principle that is based on the view that learning can
be regarded as data compression. However, it can be shown that the
modified algorithm is a kind of Bayesian learning.</p>
      <p>C = rithm. (I arbitrarily chose 50.) For all verbs with a high number
= sassinate” which are yielded by the weighting algorithm (C 50).
To test the impact of this modification on the generalization level of
the acquired tree cuts, I examined verbs with diverse numbers of
different noun complements (types) in the training sample. In particular,
I selected all verbs with a high number ( 1000), a medium number
(400–600), a low number (70–100), and a very low number (10–40)
of different complements and compared the generalization level
retrieved by the “standard MDL” algorithm and the weighting
algoand 89% of the verbs with a medium number of complements, the
weighting algorithm obtained more general classes than the standard
MDL algorithm. In contrast, more specific classes were computed for
almost all verbs with a low and a very low number of different
complements (95.9% and 99.5%, respectively). Hence, the modification
changes the behaviour of the algorithm towards the desired direction
(that variety of complements should trigger generalization).</p>
      <p>Tables 4–6 show the tree cut models for “kill”, “murder”, and
“asNow these models are at the appropriate level of generalization.</p>
      <p>A(class; verb)</p>
      <p>...
4.67
4.74
9.40
14.81
74.25
982.64
1187.54
24.04
6.47
791.57
1187.36
238.61
1187.36</p>
      <p>...</p>
      <p>As mentioned in section 1, some of the wordnets in EuroWordNet
contain thematic relations: the wordnets for Dutch, English,
Estonian, Italian, and Spanish. These relations have been manually
encoded or extracted from other lexical resources, respectively. I
employed them for the gold standard by mapping them to WordNet
(which does not contain thematic relations itself).</p>
      <p>I started from the simplifying heuristic that the patient of a verb is
usually syntactically realized as its direct object. In EuroWordNet, a
verb sense is connected to a noun sense that it prefers as its patient
by the INVOLVED PATIENT relation. Thus, I mapped the relations
of this type to WordNet.</p>
      <p>I extracted those INVOLVED PATIENT relations where both the
source node and the target node were linked to a node in the
interlingual index (ILI) by a synonymy or a near-synonymy relation.9
The inter-lingual index essentially consists of all the concept nodes
of WordNet 1.5. Thus, extracting the ILI concepts equivalent to the
source and the target concept of an INVOLVED PATIENT relation,
respectively, immediately yields a mapping of this relation to
WordNet 1.5. With this procedure, I retrieved 605 relations.</p>
      <p>However, a certain amount of these relations were inappropriate
for our task. The assumption that a patient is syntactically realized as
9 Most concepts in the language-specific wordnets are linked to a
corresponding concept in the ILI by a synonymy link. However, it is often the case that
there is no ILI concept that exactly matches a language concept. This
language concept has to be linked to a semantically related ILI concept, e.g.</p>
      <p>by a hyponymy or a hyperonymy link.</p>
      <p>Up to now, it was not possible to automatically evaluate the
“intuitiveness” of the selectional preferences acquired by a certain
approach because there was no way to tell the computer which
preferences correspond with human intuition. One only could manually
inspect a few illustrative examples and concentrate on evaluating the
performance of the approach in NLP tasks, e.g. word sense
disambiguation (which is, of course, a crucial issue). The EuroWordNet
database provides information suitable for compiling a gold standard.</p>
      <p>This gold standard allows to evaluate the lexicographic
appropriateness (the appropriateness with respect to building wordnets) of an
acquisition approach automatically and on a broader empirical basis.</p>
      <p>This section describes the evaluation of the standard MDL and the
weighting algorithm.
an object is a good starting point, but does not apply for all cases.
Unaccusative verbs (e.g. &lt;silt&gt;) realize their patient (e.g. &lt;sediment&gt;)
as their subject. Other verbs do not realize their patient as a syntactic
argument at all (e.g. &lt;delouse&gt; – &lt;louse&gt;). Patients of such verbs
cannot be found by examining verb objects.</p>
      <p>Furthermore, some relations like &lt;address&gt; INVOLVED
PATIENT &lt;addressee&gt; indicate a noun concept that itself is perfectly
adequate, but does not capture the majority of the noun instances
in the corpus. Any noun referring to a human could occur as the
patient of &lt;address&gt;. Thus, the learning algorithm should
generalize to the &lt;human&gt; level. However, &lt;addressee&gt; is a subclass of
&lt;human&gt; (which has no hyponyms itself). It makes sense to encode
thematic relations where the noun concept does not subsume all
preferred concepts extensionally, but characterizes them intensionally.</p>
      <p>However, such relations cannot be derived by generalizing from
corpus instances. They could rather be acquired by examining
derivational patterns.</p>
      <p>To obtain a gold standard that is appropriate for the evaluation of
the two algorithms, I excluded these problematic cases. 390 relations
remained.</p>
      <p>For every WordNet verb concept which was retrieved in this way,
I collected all the verbs which the concept represents and assigned
each of them the noun concepts linked to it. This means that the
information to which sense of the verb a noun concept is related is
lost. However, this is necessary to perform the comparison with the
results of the two algorithms because they compute preferences for
verbs, not verb senses. I obtained 599 verbs altogether (excluding
multiword lexemes).</p>
      <p>The intersection of the verbs in the gold standard and the verbs in
the training sample contained 522 verbs which were connected to
1082 noun concepts in the gold standard. For both algorithms (and
different values of C), I evaluated the match between the classes
acquired for a verb and the gold standard classes for that verb.
Table 7 shows the number and the percentage of the noun classes in
the gold standard which were exactly matched, not matched at all,10
or matched by more general or more specific classes in the tree cut
model. This table contains recall values (number of correct classes
that are found). Note that it does not make sense to calculate
precision (number of preferred classes in the tree cut model that are
correct) because the gold standard does not capture every sense of a
verb, i.e. it is not “complete” with respect to a particular verb.
4.2</p>
    </sec>
    <sec id="sec-9">
      <title>Evaluation results 4.3</title>
    </sec>
    <sec id="sec-10">
      <title>Discussion</title>
      <p>C = weighting;
1,000 10,000
(14.8%) 162 (15.0%)
(11.1%) 126 (11.6%)
(5.8%) 69 (6.4%)
(11.3%) 125 (11.6%)
(0.6%) 6 (0.6%)
(17.8%) 194 (17.9%)
(1.0%) 11 (1.0%)
(37.6%) 389 (36.0%)
&gt;= matched by 3 level hyponym
&gt;= matched by 3 level hyperonym
number (percentage) of
gold standard classes
exactly matched
matched by 1 level hyperonym
matched by 1 level hyponym
matched by 2 level hyperonym
matched by 2 level hyponym
not matched
C = 10; to a certain extent. Above 000, no improvement can be
observed. The tree cuts have reached their “lower limit” then. This
limit is determined by the threshold introduced to eliminate noise (cf.
section 2.3).</p>
      <p>The percentage of classes which were not matched at all is higher
for the weighting algorithm. The reason for this is that the majority of
verbs occur rather infrequently (cf. Zipf’s law) so that standard MDL
tends to acquire over-general classes for them. Thus, the chance that
the class in the gold standard is subsumed by such a general class is
higher. (Note that most of the classes matched by standard MDL are
at least 3 levels too general.)</p>
      <p>However, even with the weighting algorithm the overall results
are not satisfying. 15% of the classes in the gold standard are
exactly matched; 33% are approximated with 0 or 1 level deviation.
41.1% are matched by too general classes, but only 8% by too
specific classes. More than one third of the classes is not found at all.</p>
      <p>The main reason for this behaviour is that the selectional
preferences are acquired for verb forms, not for verb senses. Calculating
a tree cut model for a highly polysemous verb may trigger
inappropriate generalizations, since the different senses of the verb could
introduce a high variety of complement nouns, which yields
generalization, even if each sense alone prefers rather specific noun
concepts.</p>
      <p>On the other hand, it would be useful to pool verb instances which
represent the same concept when calculating tree cut models. For
example, the verbs “arrest”, “nail”, “nab”, and “cop” are represented
by the same concept in the gold standard. More appropriate
selectional preferences could be acquired if the algorithm did not
compute one tree cut for all instances of “arrest”, one for all instances
of “nail” etc., but one tree cut for all instances of “arrest”, “nail”,
“nab”, and “cop” which have the same sense. This would also reduce
the percentage of unmatched classes, since verb instances which have
a sense that does not occur in the gold standard would not be taken
into account.
obtain a gold standard for selectional preferences. With this gold
standard, lexicographic appropriateness can be evaluated
automatically and on a broader empirical basis. This evaluation shows that
the algorithm proposed in this paper is promising. However, the
results are not satisfying yet. One shortcoming of the experiments
described here (as well as in the mentioned work of Resnik, Ribas and
Li &amp; Abe) is that the learning algorithms are fed with word forms
rather than word senses, which would be adequate. Employing
corpora which are at least partially semantically disambiguated should
improve the performance significantly.</p>
      <p>In the near future, I will employ approaches for lexical
disambiguation and test their impact on the performance of the weighting
algorithm. Furthermore, I will test the methods described in this
paper for different argument slots. As large syntactically annotated
corpora are becoming more and more available, other verb–argument
relations than direct objects can be reliably extracted and fed into the
learning algorithm.
In this paper, I addressed the automatic acquisition of selectional
preferences by the statistical analysis of corpora in order to be
encoded in lexical semantic ontologies. I argued that methods which
have been proposed for the acquisition of selectional preferences do
not satisfyingly cope with the task of finding the appropriate
generalization level. I modified one of these approaches and showed that
the modified approach is much better suited for computing
generalization levels which are appropriate for ontology building. The
EuroWordNet database provides information that can be combined to</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Naoki</given-names>
            <surname>Abe</surname>
          </string-name>
          and
          <string-name>
            <given-names>Hang</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>'Learning Word Association Norms Using Tree Cut Pair Models'</article-title>
          ,
          <source>in Proc. of 13th Int. Conf. on Machine Learning</source>
          , (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Steven</given-names>
            <surname>Abney</surname>
          </string-name>
          , '
          <article-title>Partial parsing via finite-state cascades'</article-title>
          , in Workshop on Robust Parsing (ESSLLI '96), ed.,
          <source>John Carroll</source>
          , pp.
          <fpage>8</fpage>
          -
          <lpage>15</lpage>
          , (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Thomas</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Cover</surname>
          </string-name>
          and Joy A. Thomas, Elements of Information Theory, John Wiley &amp; Sons, New York,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>WordNet:</surname>
          </string-name>
          <article-title>An electronical lexical database</article-title>
          , ed.,
          <string-name>
            <surname>Christiane</surname>
            <given-names>Fellbaum</given-names>
          </string-name>
          , MIT Press, Cambridge, Mass.,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Hang</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>Naoki</given-names>
            <surname>Abe</surname>
          </string-name>
          , '
          <article-title>Generalizing Case Frames Using a Thesaurus and the MDL Principle'</article-title>
          ,
          <source>in Proc. of Int. Conf. on Recent Advances in NLP</source>
          , (
          <year>1995</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Diana</given-names>
            <surname>McCarthy</surname>
          </string-name>
          and
          <string-name>
            <given-names>Anna</given-names>
            <surname>Korhonen</surname>
          </string-name>
          .
          <article-title>Detecting verbal participation in diathesis alternations, 1998</article-title>
          . Proc.
          <article-title>of 36th Annual Meeting of the Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Wim</given-names>
            <surname>Peters</surname>
          </string-name>
          , '
          <article-title>Corpus-based conceptual characterisation of verbal predicate structures'</article-title>
          ,
          <source>in Proc. of Computational Linguistics in the Netherlands, Antwerpen</source>
          , (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Philip</given-names>
            <surname>Resnik</surname>
          </string-name>
          , '
          <article-title>Selectional preference and sense disambiguation'</article-title>
          ,
          <source>in ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why</source>
          , What, and How?, Washington,
          <string-name>
            <surname>D.C.</surname>
          </string-name>
          , (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Philip</given-names>
            <surname>Stuart</surname>
          </string-name>
          <string-name>
            <surname>Resnik</surname>
          </string-name>
          ,
          <article-title>Selection and Information: A Class-Based Approach to Lexical Relationships</article-title>
          , Dissertation, University of Pennsylvania,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Francesc</surname>
            <given-names>Ribas,</given-names>
          </string-name>
          '
          <article-title>An experiment on learning appropriate selectional restrictions from a parsed corpus'</article-title>
          ,
          <source>in Proc. of COLING</source>
          , Kyoto, (
          <year>1994</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Jorma</given-names>
            <surname>Rissanen</surname>
          </string-name>
          and Eric Sven Ristad, '
          <article-title>Language acquisition in the MDL framework'</article-title>
          , in Language Computations, ed.,
          <source>Eric Sven Ristad</source>
          , volume
          <volume>17</volume>
          of Series in
          <source>Discrete Mathematics and Theoretical Computer Science</source>
          ,
          <volume>149</volume>
          -
          <fpage>166</fpage>
          , DIMACS, (
          <year>1992</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Piek</given-names>
            <surname>Vossen</surname>
          </string-name>
          , ed. EuroWordNet Final Document.
          <source>EuroWordNet (LE2- 4003</source>
          ,
          <fpage>LE4</fpage>
          -8328),
          <year>1999</year>
          . Deliverable D032D033/
          <year>2D014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>