Learning Relations using Collocations
                           Gerhard Heyer, Martin Läuter, Uwe Quasthoff, Thomas Wittig, Christian Wolff
                                                         Leipzig University
                               Computer Science Institute, Natural Language Processing Department
                                                       Augustusplatz 10 / 11
                                                          D-04109 Leipzig
                                {heyer, laeuter, quasthoff, wittig, wolff}@informatik.uni-leipzig.de

                           Abstract                                    based word collocations stands out as an especially valu-
This paper describes the application of statistical analysis of        able tool for corpus-based language technology applica-
large corpora to the problem of extracting semantic relations          tions (see Quasthoff 1998A, Quasthoff & Wolff 2000).
from unstructured text. We regard this approach as a viable            Additional, application oriented tools exist for search
method for generating input for the construction of ontologies as      engine optimization as well as automatic document classi-
ontologies use well-defined semantic relations as building
blocks (cf. van der Vet & Mars 1998). Starting from a short            fication (see Heyer, Quasthoff & Wolff 2000). The cor-
description of our corpora as well as our language analysis tools,     pora are available on the WWW (http://www. wortschatz.
we discuss in depth the automatic generation of collocation sets.      uni-leipzig.de) and may be used as a large online diction-
We further give examples of different types of relations that may      ary.
be found in collocation sets for arbitrary terms. The central
question we deal with here is how to postprocess statistically         2     Collocations
generated collocation sets in order to extract named relations.
We show that for different types of relations like cohyponyms or       The occurrence of two or more words within a well-
instance-of-relations, different extraction methods as well as         defined unit of information (sentence, document) is called
additional sources of information can be applied to the basic          a collocation. For the selection of meaningful and signifi-
collocation sets in order to verify the existence of a specific type   cant collocations, an adequate collocation measure has to
of semantic relation for a given set of terms.
                                                                       be defined. In the literature, quite a number of different
                                                                       collocation measures can be found; for an in-depth dis-
1      Analysis of Large Text Corpora                                  cussion of various collocation measures and their applica-
Corpus Linguistics is generally understood as a branch of              tion cf. Smadja 1993, Lemnitzer 1998, Krenn 2000.
computational linguistics dealing with large text corpora
for the purpose of statistical processing of language data             2.1    The Collocation Measure
(cf. Armstrong 1993, Manning & Schütze 1999). With the                 In the following, our approach towards measuring the
availability of large text corpora and the success of robust           significance of the joint occurrence of two words A and B
corpus processing in the nineties, this approach has re-               in a sentence is discussed. Let
cently become increasingly popular among computational                 a, b     be the number of sentences containing A and B,
linguists (cf. Sinclair 1991, Svartvik 1992).                          k        be the number of sentences containing both A
Since 1995 a German text corpus of more than 300 mil-                           and B,
lion words has been collected (cf. Quasthoff 1998B,                    n        be the total number of sentences.
Quasthoff & Wolff 2000), containing approx. 6 million
different word forms in approx. 13 million sentences,                  Our significance measure calculates the probability of
which serves as input for the analysis methods described               joint occurrence of rare events. The results of this meas-
below. Similarly structured corpora have recently been set             ure are quite similar to the well-known log-likelihood-
                                                                       measure (cf. Krenn 2000):
up for other European languages as well (English, French,
Dutch), with more languages to follow in the near future               Let x = ab/n and define:
(see table 1).
                                                                                                                 k −1
                                                                                                                          
                                                                                                − log  1 − e − x ∑ ⋅ x i 
                                                                                                                       1
                  German   English      Dutch      French
                                                                                   sig(A, B ) =                  i=0 i ! .
    word tokens   300 M    250 M        22 M       15 M
                                                                                                           log n
    sentences     13.4 M   13 M         1.5 M      860,000
    word types    6M       1.2 M        600,000    230,000
                                                                       For 2x < k, we get the following approximation which is
  Table 1: Basic Characteristics of the Corpora                        much easier to calculate:
The basic goal of this corpus-based approach is to collect                       sig(A,B) = (x – k log x + log k!) / log n
large amounts of textual data as input for semantic proc-              In the case of next neighbor collocations we replace the
essing. Starting off from a rather simple data model tai-              definition of the above variables by the following. Instead
lored for large amounts of data and efficient processing               of a sentence we consider pairs (A, B) of words which are
using a relational data base system at storage level we                next neighbors in this sentence. Hence, instead of one
employ a simple yet powerful technical infrastructure for              sentence of n words we have n - 1 pairs. For right neigh-
processing texts to be included in the corpus. Beside basic            bor collocations (A, B) let
procedures for text integration into the corpus various                a, b     be the number of pairs of type (A, ?) and (?, B)
tools have been developed for post-processing linguistic                        resp.,
data. Among them the automatic calculation of sentence-                k        be the number of pairs (A, B),
n         be the total number of pairs. This equals the total   word, the rest of the picture is automatically computed,
          number of running words minus the number of           but represents semantic connectedness surprisingly well.
          sentences.                                            Unfortunately the relations between the words are just
Given these variables, the significance measure is calcu-       presented, but not yet named. Fig. 1 shows the collocation
lated as shown above. In general, this measure yields           graph for space. Three different meaning contexts can be
semantically acceptable collocation sets for values above       recognized in the graph:
an empirically determined positive threshold (see exam-         • real estate,
ples in section 3 below).                                       • computer hardware, and
                                                                • astronautics.
2.2    Properties of the Collocation Measure                    The connection between address and memory results from
In order to describe basic properties of this measure, we       the fact that address is another polysemous concept.
write sig(n, k, a, b) instead of sig(A, B) where n, k, a, and
b are defined as above.
Simple co-occurance: A and B occur only once, and they

                                         
occur together:
         sig(n,1,1,1) 1               (for n ).
Independence: A and B occur statistically independently

                                        
with probabilities p and q:
         sig(n,npq,np,nq)             (for n ).

                                           
Additivity: The unification of the words B and B‘ just adds

                                      
the corresponding significances. For k/b          we have
         sig(n,k,a,b) + sig(n,k‘,a,b‘) sig(n,k+k‘,a,b+b‘)
Enlarging the corpus by a factor m:
         sig(mn, mk, ma, mb) = m sig(n, k, a, b)

1.3    Finding Collocations                                     Fig. 1: Collocation Graph for space
For calculating the collocation measure for any reasonable
pairs we first count the joint occurrences of each pair.
                                                                3     Relations Represented by Collocations
This problem is complex both in time and storage. Nev-          If we fix one word and look at its set of collocates, then
ertheless, we managed to calculate the collocation meas-        some semantic relations appear more often than others.
ure for any pair with total frequency of at least 3 for each    The following example shows the most significant collo-
component. Our approach is based on extensible ternary          cations for king ordered by significance:
search trees (cf. Bentley & Sedgewick 1998) where a             queen (90), mackerel (83), hill (49), Milken (47), royal (44),
count can be associated to a pair of word numbers. The          monarch (33), King (30), crowned (30), migratory (30), rook
memory overhead from the original implementation could          (29), throne (29), Jordanian (26), junk-bond (26), Hussein (25),
be reduced by allocating the space for chunks of 100,000        Saudi (25), monarchy (25), crab (23), Jordan (22), Lekhanya
nodes at once. Even when using this technique on a large        (21), Prince (21), Michael (20), Jordan's (19), palace (19),
memory computer more than one run through the corpus            undisputed (18), Elvis (17), Shah (17), deposed (17), Panchayat
may be necessary, taking care that every pair is only           (16), Zahir (16), fishery (16), former (16), junk (16), constitution
                                                                (15), exiled (15), Bhattarai (14), Presley (14), Queen (14),
counted once. The resulting word pairs above a threshold
                                                                crown (14), dethroned (14), him (14), Arab (13), Moshoeshoe
significance are put into a database where they can be
                                                                (13), himself (13), pawns (13), reigning (13), Fahd (12), Nepali
accessed and grouped in many different ways. As collo-
                                                                (12), Rome (12), Saddam (12), once (12), pawn (12), prince
cations are calculated for different language corpora, our      (12), reign (12), [...] government (10) [...]
examples will be taken from the English as well as the          The following types of relations can be identified:
German database.                                                • Cohyponymy (e. g. Shah, queen, rook, pawn),
1.4    Visualization of Collocations                            • top-level syntactic relations, which translate to se-
                                                                     mantic ‘actor-verb’ and often used properties of a
Beside textual output of collocation sets, visualizing them          noun (reign; royal, crowned, dethroned),
as graphs is an additional type of representation: We
                                                                • instance-of (Fahd, Hussein, Moshoeshoe),
choose a word and arrange its collocates in the plane so
                                                                • special relations given by multiwords (A prep/det/
that collocations between collocates are taken into ac-
                                                                     conj B, e. g. king of Jordan), and
count. This results in graphs that show homogeneity
where words are interconnected and they show separation         • unstructured set of words describing some subject
where collocates have little in common. Linguistically               area, e. g. constitution, government.
speaking, polysemy is made visible (see fig. 1 below).          Note that synonymy rarely occurs in the lists. The rela-
Technically speaking, we use simulated annealing to             tions may be classified according to the properties sym-
position the words (see Davidson & Harel 1996). Line            metry, anti-symmetry, and transitivity.
thickness represents the significance of the collocation. Of
course, all words in the graph are linked to the central
3.1    Symmetric Relations                                            high probability) that the unknown category ? is in
Let us call a relation r symmetric if r(A, B) always implies          fact a instance name. Examples are:
r(B, A). Examples of symmetric relations are                                      metals like nickel, arsenic and lead
• synonymy,                                                                       rivers like the Ganges
                                                                                  newspapers like Pravda
• cohyponomy (or similarity),
                                                               The applicability of patterns like these may heavily de-
• elements of a certain subject area, and
                                                               pend on language characteristics like preposition usage.
• relations of unknown type.                                   This type of extraction method is simple and well known;
Usually, sentence collocations express symmetric rela-         in our approach it is combined with collocation analysis,
tions.                                                         thus yielding better results both in quality and in quantity
                                                               (see section 5).
3.2    Anti-symmetric Relations
Let us call a relation r anti-symmetric if r(A, B) never       4.2    Compounds
implies r(B, A). Examples of anti-symmetric relations are      German compounds consist of two (or more) words glued
• hyponymy and                                                 together by varying mechanisms. The head word (coming
• relations between properties and its owners like ac-         second) is further determined by the first part of the com-
     tion and actor or class and instance.                     pound (modifier), which may originally be an adjective,
Usually, next neighbor collocations of two words express       another noun or a verb stem. In almost all cases a seman-
anti-symmetric relations. In the case of next neighbor         tic relation between both parts and the compound can be
collocations consisting of more than two words (like A         found. In section 5.3 we show how the combination of
prep/det/conj B e. g. Samson and Delilah), the relation        compound segmentation with collocation analysis can be
might be symmetric, for instance in the case of conjunc-       used for identifying named relations in compounds.
tions like and or or (cf. Läuter & Quasthoff 1999).
                                                               4.3    Feature Vectors Given by Collocations and
3.3    Transitivity                                                   Clustering
Transitivity of a relation means that r(A, B) and r(B, C)      To investigate the meaning of a word A, its contexts in the
always implies r(A, C). In general, a relation found ex-       texts have to be examined because they reflect the use of
perimentally will not be transitive, of course. But there      A. If two words A and B have similar contexts, that is,
may be a part where transitivity holds.                        they are alike in their use, this indicates that there is a
Some of the most prominent transitive relations are the        semantic relation between A and B of some kind.
cohyponymy, hyponymy, synonymy, and is-a relations.            A kind of average context for every word A is formed by
Note that our graphical representation mainly shows tran-      all collocations for A with a significance above a certain
sitive relations per construction. This kind of relation is    threshold.
also able to give further results in the combination proce-    This average context of A is transferred into a feature
dures described below.                                         vector of A using all words as features as usual. This re-
                                                               sults in sparse vectors used for description. The feature
4     Other Sources for Relations                              vector of word A is indeed a description of the meaning of
While we may intellectually identify types of semantic         A, because the most important words of the contexts of A
relations in collocations sets, additional information and /   are included.
or analysis is needed for automatically naming these rela-     Clustering of feature vectors can be used to investigate the
tions. In the following, we give different examples for        relations between a group of similar words and to figure
such complementary information.                                out whether or not all the relations are of the same kind.
                                                               The following HACM algorithm has an additional natural
4.1    Pattern Based Relations                                 reason to stop. It works bottom up like this:
Simple pattern-based relations can be extracted from text      • All words are treated as (basic) items. Each item has
if knowledge about information categories like proper               a description (feature vector).
names is used as input. As our corpora include several         • In each step of the clustering process the two items A
large lists of classified terms like names of professions           and B with the most similar description vectors are
and last names, extraction rules may be defined:                    searched and fitted together to create a new complex
    i. Extraction of first names:                                   item C combining the words in A and B. The scalar
       A pattern like (profession) ? (last name) implies            product is used for determining similarity between
       (with high probability) that the unknown category            vectors.
       ? is in fact a first name. Examples are                      Each step of the clustering algorithm reduces the
                   actress Julia Roberts                            number of items by one.
                   hockey hero Wayne Gretzky                   • The feature vector for C is constructed from the fea-
                   Senator Jesse Helms                              ture vectors of A and B. Therefore we calculate a
   ii. Extraction of instance-of-relations given the class          combined significance for C with respect to all words
       name: The pattern (class name) like ? implies (with          Xi as follows:
                           na                       nb                    subjected to the collocation analysis again and again. We
      sig (C, X i ) =            sig (C, X i ) +          sig (C, X i )
                        n a + nb                 n a + nb                 might expect that some of the collocational relations are
                                                                          strengthened while others will vanish from the iterated
     for all i, 1 ≤ i ≤ n with
                                                                          sets of collocations which we will call higher order collo-
     n total number of words in the corpus,
                                                                          cations. We describe two experiments for the iteration
     na number of words combined in item A, and
                                                                          process: Instead of plain text we start with collocation
     nb number of words combined in item B.
                                                                          sets, using sentence collocations for experiment 1 and
• The algorithm stops if only one item is left or if all
                                                                          next neighbor collocations for experiment 2. In the case of
     remaining feature vectors are orthogonal. This results
                                                                          a symmetric relation we observe a strengthening while
     usually in a very natural clustering if the threshold for
                                                                          iterating sentence collocations. In the case of an anti-
     constructing the feature vectors is suitably chosen.
                                                                          symmetric relation we observe the same when iterating
A cluster of words with probably the same semantic
                                                                          next neighbor collocations.
relation between each of them can be found in the analy-
sis tree by comparing the similarity between items inside                 Experiment 1: Iterating Sentence Collocations
the items A and B (if these items are complex) with the                   The production of collocations is applied to sets of sen-
calculated similarity between A and B, when fitting them                  tence collocations instead of sentences. E.g., the collec-
together to C. If there is a large difference between them,               tion of 500,000 sentence collocations has the following
this is an indication for a different relation between words              ‘sentence‘ (collocation set) for Hemd (shirt): Hemd
combined in item A and words combined in item B. In the                   Krawatte Hose weißes Anzug weißem Jeans trägt trug
appendix, some examples for this type of semantic clus-                   bekleidet weißen Jacke schwarze Jackett schwarzen Weste
tering are given.                                                         kariertes Schlips Mann
Symmetric clustering                                                      Example for iterated sentence collocations of Eisen (iron):
If we assume that a cluster represents a semantic relation,               Original collocations: Stahl, heißes, heiße, Kupfer, Man-
the cluster should represent the possible symmetry and                    gan, alten, Feuer, Zink, Holz, Marmor
transitivity of the underlying semantic relation.                         Iterated collocations: Kupfer, Stahl, Zink, Aluminium,
Symmetry and transitivity ensure that the terms to be                     Magnesium, Mangan, Nickel, Blei, Zinn, Gold
clustered will themselves be responsible for the cluster-                 As expected, the iterated collocation set only contains
ing. This in turn implies that the terms found in the cluster             cohyponyms.
will also be found in the feature vector in prominent posi-
tions.                                                                    Experiment 2: Iterating Next Neighbor Collocations
In example 1 (Appendix) the clustering result for January                 In this experiment, the production of collocations is ap-
is shown. In the first column we find the terms to be                     plied to sets of next neighbor collocations instead of sen-
clustered, on the right hand side there are the components                tences. The collection of 250,000 next neighbor colloca-
of the feature vectors ordered by significance.                           tions has the following two ‘sentences‘ for Hemd (shirt):
The clustered items both appear together and share a cer-                 weißes weißem weißen blaues kariertes kariertem offenem
tain aspect. The names of the months or weekdays as                       aufs karierten gestreiftes letztes [...] (left neighbors)
names for periods of time cluster together, just because                  näher bekleidet ausgezogen spannt trägt aufknöpft ausge-
they are collocates with one another. The same can be                     plündert auszieht wechseln aufgeknöpft ausziehen [...]
shown to be true for teammates, metals, colors or fruit.                  (right neighbors)
                                                                          Example for iterated neighbor collocations of Auto (car):
Anti-symmetric clustering
For anti-symmetric relations the situation is different.                  Original collocations: fahren, Wagen, prallte, Fahrer,
Again the elements of the original set to be clustered                    seinem, fuhr, fährt, Polizei, erfaßt, gefahren
share a certain aspect, but this aspect is described by a                 Iterated collocations: Wagen, Lastwagen, Fahrzeug,
distinct set of words. Presumably this second set of words                Autos, Personenwagen, Bus, Zug, Haus, Lkw, Pkw
will also cluster. Moreover, it will use the original set as              Example for iterated neighbor collocations of erklärte
clustering terms.                                                         (explained):
This is shown in example 2 (Appendix). Here we show                       Original collocations: Sprecher, werde, gestern, seien,
that the set given by Präsident, Vorsitzender, Vorsitzende,               Wir, bereit, wolle, Vorsitzende, Anfrage, Präsident
Sprecher, Sprecherin properly clusters using words like                   Iterated collocations: sagte, betonte, sprach, kündigte,
sagte, erklärte, teilte (German verbs of utterance).                      wies, nannte, warnte, bekräftigte, meinte, kritisierte
Conversely, in example 3 (Appendix) we find the set
verwies, mitteilte, meinte, bestätigte, betonte properly                  Both, experiment 1 and experiment 2 result in collocation
clusters using terms from the above cluster.                              sets carrying a homogeneous semantic relation.

4.4    Homogeneous Relations: Iterating the                               5    Combining Non-contradictory Partial
       Collocation Process                                                     Results
The extraction of collocation sets from plain text can be                 In section 3 we have given evidence that collocation sets
viewed as some kind of information condensation. This                     contain various types of semantic relations without ex-
process can be iterated if collocation sets themselves are                plicitly naming them while section 4 has introduced a
number of methods for relation extraction. This section        Example: As result 1 we might know that Schwanz (tail)
shows different ways of combining results of these ex-         is part of Pferd (horse). Similar terms to Pferd are both
traction approaches. The results of these combination give     Kuh (cow) and Hund (dog) (result 2). Both of them have
more and / or better results.                                  the term Schwanz in their set of significant collocations
                                                               (result 3). Hence we might correctly conjecture that both
5.1    Identical Results                                       Kuh and Hund have a tail (Schwanz) as part of their body.
Two or more of the above algorithms may suggest a cer-         In contrast, Reiter (rider) is a strong collocation to Pferd
tain relation between two words, for instance, cohypo-         and might (incorrectly) be conjectured to be another
nymy.                                                          similar concept, but Reiter is no collocation with respect
Example: If both the second order collocations introduced      to Schwanz. Hence, the absence of result 3 prevents us
in section 4.4, and clustering by feature vectors (sec-        from making an incorrect conclusion.
tion 4.3) independently yield similar sets of words as a
result, this may be taken as an indication of cohyponymy       5.4    Similarity Used to Infer a Strong Property
between the words, e. g. sagte, betonte, kündigte, wies,       Let us call an property p important, if it is preserved under
nannte, warnte, bekräftigte, meinte […] (German verbs of       similarity. This strong feature can be used as follows:
utterance).                                                    Result 1: A has a certain important property p
                                                               Result 2: B is similar to A (i. e., B is a cohyponym of A)
5.2    Supporting Second Results                               Conclusion: B has the same property p
In the second combination type a known relation given by       Example: We consider A and B as similar if they are in the
one method of extraction is verified by an identical but       set of right neighbor collocations of Hafenstadt (port
unnamed second result as follows:                              town) (result 2). If we know that Hafenstadt is a property
Result 1: There is certain relation r between A and B          of its typical right neighbors (result 1) we may infer this
Result 2: There is some strong (but unknown) relation          property for more then 200 cities like Split, Sidon,
between A and B (e. g. given by a collocation set)             Durban, Kismayo, Tyrus, Vlora, Karachi, Durres, […].
Conclusion: Result 1 holds with more evidence.
One can use this support of orthogonal tests in many           5.5    Subject Area Inferred from Collocation Sets
ways: Without knowing anything about deeper language           Result 1: A, B, C, ... are collocates of a certain term.
structure or parsing we can filter out verbs just by testing   Result 2: Some of them belong to a certain subject area.
if a string accepts at least two of the endings –(e)s, -ing    Conclusion: All of them belong to this subject area.
and –ed/t. The recall is remarkably high. In German we         Example: Consider the following top entries in the collo-
tested only one mechanism of noun formation from a verb        cation set of carcinoma: patients, cell, squamous,
and got 70% of all verbs with a precision of 83%.              radiotherapy, lung, thyroid, treated, hepatocellular,
Word formation mechanisms can be explored further. In          metastases, adenocarcinoma, cervix, irradiation, breast,
German compound nouns are joint together to form one           treatment, CT, therapy, renal, cases, bladder, cervical,
word. There are several (highly irregular) patterns of         tumor, cancer, metastatic, radiation, uterine, ovarian,
gluing letters between the words. Testing all available        chemotherapy, […]
word tokens whether they could be the compound of two          If we know that some of them belong to the subject area
stemmed words from word lists of 93,000 current nouns          Medicine, we can add this subject area to the other mem-
reveals just under a million compounds in their stemmed        bers of the collocation set as well.
form. Here stemming accuracy is supported by the exis-
tence of both compounds in the basic list. When elimi-         6     Conclusion
nating a hundred words which are prone to generate             In this paper, we described different approaches for the
wrong separations this algorithm achieves an accuracy of       extraction of named semantic relations from large text
90%.                                                           corpora. The types of relations are compatible with rela-
Example:                                                       tions typically used for constructing ontologies (cf.
Result 2: The German compound Entschädigungsgesetz             Chandrasekaran 1999:22). The combination of different
can be divided into Gesetz and Entschädigung with an           types of input information as well as the application of
unknown relation.                                              robust statistical analysis methods guarantees that this
Result 1 is given by the four word next neighbor colloca-      approach may be applied to texts from arbitrary domains
tion Gesetz über die Entschädigung. Similarly                  and different languages. Especially, our results may be
Stundenkilometer is analyzed as Kilometer pro Stunde.          used for the automatic generation of semantic relations in
In these examples, result 1 is not enough because there are    order to fill and expand ontology hierarchies.
collocations like Woche auf dem Tisch which do not de-
scribe a meaningful semantic relation.
                                                               7     References
5.3    Combining Three Results                                 Armstrong, S. (ed.) (1993). Using Large Corpora. Computa-
Result 1: There is relation r between A and B                    tional Linguistics 19(1/2) (1993) [Special Issue on Corpus
Result 2: B is similar to B’ (cohyponymy)                        Processing, repr. MIT Press 1994].
Result 3: There is some strong but unknown relation be-        Bentley, J.; Sedgewick, R. (1998). “Ternary Search Trees.”
tween A and B’                                                   In: Dr. Dobbs Journal, April 1998.
Conclusion: There is a relation r between A and B’
Chandrasekaran, B. et al. (1999). “What are Ontologies, and        Manning, Ch. D.; Schütze, H. (1999). Foundations of Statis-
  Why Do We Need Them?” In: Intelligent Systems 14(1)                tical Language Processing. Cambridge/MA, London: The
  (1999), 20-26.                                                     MIT Press.
Davidson, R., Harel, D. (1996). “Drawing Graphs Nicely             Quasthoff, U. (1998A). “Tools for Automatic Lexicon
  Using Simulated Annealing.” In: ACM Transactions on                Maintenance: Acquisition, Error Correction, and the Gen-
  Graphics 15(4), 301-331.                                           eration of Missing Values.“ In: Proc. First International
Francis, W.; Kucera, H. (1982). Frequency Analysis of Eng-           Conference on Language Resources & Evaluation
  lish Language. Boston: Houghton Mifflin.                           [LREC], Granada, May 1998, Vol. II, 853-856.
Heyer, G.; Quasthoff, U.; Wolff, Ch. (2000). “Aiding Web           Quasthoff, U. (1998B). “Projekt der deutsche Wortschatz.”
  Searches by Statistical Classification Tools.“ In: Knorz,          In: Heyer, G., Wolff, Ch. (eds.). Linguistik und neue Me-
  G.; Kuhlen, R. (eds.) (2000). Informationskompetenz -              dien. Wiesbaden: Dt. Universitätsverlag, 93-99.
  Basiskompetenz in der Informationsgesellschaft. Proc. 7.         Quasthoff, U.; Wolff, Ch. (2000). “An Infrastructure for
  Intern. Symposium f. Informationswissenschaft, ISI                 Corpus-Based Monolingual Dictionaries.” In: Proc.
  2000, Darmstadt. Konstanz: UVK, 163-177.                           LREC-2000. Second International Conference On Lan-
Krenn, B. (2000). “Distributional and Linguistic Implications        guage Resources and Evaluation. Athens, May/June 2000,
  of Collocation Identification.” In: Proc. Collocations             Vol. I, 241-246.
  Workshop, DGfS Conference, Marburg, March 2000.                  Sinclair, J. (1991). Corpus Concordance Collocation. Ox-
Läuter, M., Quasthoff, U. (1999). “Kollokationen und seman-          ford: Oxford University Press.
  tisches Clustering.” In: Gippert, J. (ed.) (1999). Multilingu-   Smadja F. (1993). “Retrieving Collocations from Text:
  ale Corpora. Codierung, Strukturierung, Analyse. Proc. 11.         Xtract.” In: Computational Linguistics 19(1) (1993), 143-
  GLDV-Jahrestagung. Prague: Enigma Corporation, 34-41.              177.
Lemnitzer, L. (1998). “Komplexe lexikalische Einheiten in          Svartvik, J. (ed.) (1992). Directions in Corpus Linguistics:
  Text und Lexikon.” In: Heyer, G.; Wolff, Ch. (eds.). Lin-          Proc. Nobel Symposium 82, Stockholm, 4-8 August 1991.
  guistik und neue Medien. Wiesbaden: Dt. Universitäts-              Berlin: Mouton de Gruyter [=Trends in Linguistics 65].
  verlag, 85-91.                                                   van der Vet, P. E.; Mars, N. J. I. (1998). “Bottom-Up Con-
                                                                     struction of Ontologies.” In: IEEE Transactions on Know-
                                                                     ledge and Data Engineering 10(4) (1998), 513-526.

8     Appendix: Clustering Examples
8.1    Example (1): Clustering Months and Days
Jahres     _____________________ Uhr, Ende, abend, vergangenen, Anfang, Jahres, Samstag, Freitag, Mitte, Sonntag
Donnerstag _                      | Uhr, abend, heutigen, Nacht, teilte, Mittwoch, Freitag, worden, mitteilte, sagte
Dienstag   _|_                   | Uhr, abend, heutigen, teilte, Freitag, worden, kommenden, sagte, mitteilte, Nacht
Montag     _ |                   | Uhr, abend, heutigen, Dienstag, kommenden, teilte, Freitag, worden, sagte, morgen
Mittwoch   _|_|_                 | Uhr, abend, heutigen, Nacht, Samstag, Freitag, Sonntag, kommenden, nachmittag
Samstag    ___ |                 | Uhr, abend, Samstag, Nacht, Sonntag, Freitag, Montag, nachmittag, heutigen
Sonntag    _ | |                 | Uhr, abend, Samstag, Nacht, Montag, kommenden, morgen, nachmittag, vergangenen
Freitag    _|_|_|_____________ | Uhr, abend, Ende, Jahres, Samstag, Anfang, Freitag, Sonntag, heutigen, worden
Januar     _________________ | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, August, März, Januar
August     _______________ | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, August, Januar, März
Juli       _____________ | | | | Uhr, Jahres, Ende, Anfang, Mitte, Mai, Samstag, August, Januar, März
März       ___________ | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, Januar, März, April
Mai        _________ | | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, März, Januar, Mai, vergangenen
September _______ | | | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Mai, Januar, März, Samstag, vergangenen
Februar    _       | | | | | | | | Uhr, Januar, Jahres, Anfang, Mitte, Ende, März, November, Samstag, vergangenen
Dezember   _|___ | | | | | | | | Uhr, Jahres, Ende, Anfang, Mitte, Mai, Januar, März, Samstag, vergangenen
November   _     | | | | | | | | | Uhr, Jahres, Ende, Anfang, Mitte, September, vergangenen, Dezember, Samstag
Oktober    _|_ | | | | | | | | | Uhr, Ende, Jahres, Anfang, Mai, Mitte, Samstag, September, März, vergangenen
April      _ | | | | | | | | | | Uhr, Ende, Jahres, Mai, Anfang, März, Mitte, Prozent, Samstag, Hauptversammlung
Juni       _|_|_|_|_|_|_|_|_|_|_|_

8.2    Example (2): Clustering Leaders
Präsident    _________ sagte, Boris Jelzin, erklärte, stellvertretende, Bill Clinton, stellvertretender, Richter
Vorsitzender _______ | sagte, erklärte, stellvertretende, stellvertretender, Richter, Abteilung, bestätigte
Vorsitzende ___     | | sagte, erklärte, stellvertretende, Richter, bestätigte, Außenministeriums, teilte, gestern
Sprecher     _ |    | | sagte, erklärte, Außenministeriums, bestätigte, teilte, gestern, mitteilte, Anfrage
Sprecherin   _|_|_ | | sagte, erklärte, stellvertretende, Richter, Abteilung, bestätigte, Außenministeriums, sagt
Chef         _    | | | Abteilung, Instituts, sagte, sagt, stellvertretender, Professor, Staatskanzlei, Dr.
Leiter       _|___|_|_|_

8.3    Example (3): Clustering Verbs of Utterance
verwies   _____________ Sprecher, werde, gestern, Vorsitzende, Polizei, Sprecherin, Anfrage, Präsident, gebe
mitteilte ___________ | Sprecher, werde, gestern, Vorsitzende, Polizei, Sprecherin, Anfrage, Präsident, Montag
meinte    _______    | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Anfrage, Präsident, gebe, Interview
bestätigte_____ |    | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Anfrage, Präsident, gebe, Interview
betonte   ___ | |    | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, gebe, Interview, würden, Bonn
sagte     _ | | |    | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, gebe, Interview, würden
erklärte _|_|_|_|_ | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, Anfrage, gebe, Interview
warnte    _        | | | Präsident, Vorsitzende, SPD, eindringlich, Ministerpräsident, CDU, Außenminister, Zugleich
sprach    _|_______|_|_|_