Learning Relations using Collocations Gerhard Heyer, Martin Läuter, Uwe Quasthoff, Thomas Wittig, Christian Wolff Leipzig University Computer Science Institute, Natural Language Processing Department Augustusplatz 10 / 11 D-04109 Leipzig {heyer, laeuter, quasthoff, wittig, wolff}@informatik.uni-leipzig.de Abstract based word collocations stands out as an especially valu- This paper describes the application of statistical analysis of able tool for corpus-based language technology applica- large corpora to the problem of extracting semantic relations tions (see Quasthoff 1998A, Quasthoff & Wolff 2000). from unstructured text. We regard this approach as a viable Additional, application oriented tools exist for search method for generating input for the construction of ontologies as engine optimization as well as automatic document classi- ontologies use well-defined semantic relations as building blocks (cf. van der Vet & Mars 1998). Starting from a short fication (see Heyer, Quasthoff & Wolff 2000). The cor- description of our corpora as well as our language analysis tools, pora are available on the WWW (http://www. wortschatz. we discuss in depth the automatic generation of collocation sets. uni-leipzig.de) and may be used as a large online diction- We further give examples of different types of relations that may ary. be found in collocation sets for arbitrary terms. The central question we deal with here is how to postprocess statistically 2 Collocations generated collocation sets in order to extract named relations. We show that for different types of relations like cohyponyms or The occurrence of two or more words within a well- instance-of-relations, different extraction methods as well as defined unit of information (sentence, document) is called additional sources of information can be applied to the basic a collocation. For the selection of meaningful and signifi- collocation sets in order to verify the existence of a specific type cant collocations, an adequate collocation measure has to of semantic relation for a given set of terms. be defined. In the literature, quite a number of different collocation measures can be found; for an in-depth dis- 1 Analysis of Large Text Corpora cussion of various collocation measures and their applica- Corpus Linguistics is generally understood as a branch of tion cf. Smadja 1993, Lemnitzer 1998, Krenn 2000. computational linguistics dealing with large text corpora for the purpose of statistical processing of language data 2.1 The Collocation Measure (cf. Armstrong 1993, Manning & Schütze 1999). With the In the following, our approach towards measuring the availability of large text corpora and the success of robust significance of the joint occurrence of two words A and B corpus processing in the nineties, this approach has re- in a sentence is discussed. Let cently become increasingly popular among computational a, b be the number of sentences containing A and B, linguists (cf. Sinclair 1991, Svartvik 1992). k be the number of sentences containing both A Since 1995 a German text corpus of more than 300 mil- and B, lion words has been collected (cf. Quasthoff 1998B, n be the total number of sentences. Quasthoff & Wolff 2000), containing approx. 6 million different word forms in approx. 13 million sentences, Our significance measure calculates the probability of which serves as input for the analysis methods described joint occurrence of rare events. The results of this meas- below. Similarly structured corpora have recently been set ure are quite similar to the well-known log-likelihood- measure (cf. Krenn 2000): up for other European languages as well (English, French, Dutch), with more languages to follow in the near future Let x = ab/n and define: (see table 1).  k −1  − log  1 − e − x ∑ ⋅ x i  1 German English Dutch French sig(A, B ) =  i=0 i ! . word tokens 300 M 250 M 22 M 15 M log n sentences 13.4 M 13 M 1.5 M 860,000 word types 6M 1.2 M 600,000 230,000 For 2x < k, we get the following approximation which is Table 1: Basic Characteristics of the Corpora much easier to calculate: The basic goal of this corpus-based approach is to collect sig(A,B) = (x – k log x + log k!) / log n large amounts of textual data as input for semantic proc- In the case of next neighbor collocations we replace the essing. Starting off from a rather simple data model tai- definition of the above variables by the following. Instead lored for large amounts of data and efficient processing of a sentence we consider pairs (A, B) of words which are using a relational data base system at storage level we next neighbors in this sentence. Hence, instead of one employ a simple yet powerful technical infrastructure for sentence of n words we have n - 1 pairs. For right neigh- processing texts to be included in the corpus. Beside basic bor collocations (A, B) let procedures for text integration into the corpus various a, b be the number of pairs of type (A, ?) and (?, B) tools have been developed for post-processing linguistic resp., data. Among them the automatic calculation of sentence- k be the number of pairs (A, B), n be the total number of pairs. This equals the total word, the rest of the picture is automatically computed, number of running words minus the number of but represents semantic connectedness surprisingly well. sentences. Unfortunately the relations between the words are just Given these variables, the significance measure is calcu- presented, but not yet named. Fig. 1 shows the collocation lated as shown above. In general, this measure yields graph for space. Three different meaning contexts can be semantically acceptable collocation sets for values above recognized in the graph: an empirically determined positive threshold (see exam- • real estate, ples in section 3 below). • computer hardware, and • astronautics. 2.2 Properties of the Collocation Measure The connection between address and memory results from In order to describe basic properties of this measure, we the fact that address is another polysemous concept. write sig(n, k, a, b) instead of sig(A, B) where n, k, a, and b are defined as above. Simple co-occurance: A and B occur only once, and they   occur together: sig(n,1,1,1) 1 (for n ). Independence: A and B occur statistically independently   with probabilities p and q: sig(n,npq,np,nq) (for n ).   Additivity: The unification of the words B and B‘ just adds  the corresponding significances. For k/b we have sig(n,k,a,b) + sig(n,k‘,a,b‘) sig(n,k+k‘,a,b+b‘) Enlarging the corpus by a factor m: sig(mn, mk, ma, mb) = m sig(n, k, a, b) 1.3 Finding Collocations Fig. 1: Collocation Graph for space For calculating the collocation measure for any reasonable pairs we first count the joint occurrences of each pair. 3 Relations Represented by Collocations This problem is complex both in time and storage. Nev- If we fix one word and look at its set of collocates, then ertheless, we managed to calculate the collocation meas- some semantic relations appear more often than others. ure for any pair with total frequency of at least 3 for each The following example shows the most significant collo- component. Our approach is based on extensible ternary cations for king ordered by significance: search trees (cf. Bentley & Sedgewick 1998) where a queen (90), mackerel (83), hill (49), Milken (47), royal (44), count can be associated to a pair of word numbers. The monarch (33), King (30), crowned (30), migratory (30), rook memory overhead from the original implementation could (29), throne (29), Jordanian (26), junk-bond (26), Hussein (25), be reduced by allocating the space for chunks of 100,000 Saudi (25), monarchy (25), crab (23), Jordan (22), Lekhanya nodes at once. Even when using this technique on a large (21), Prince (21), Michael (20), Jordan's (19), palace (19), memory computer more than one run through the corpus undisputed (18), Elvis (17), Shah (17), deposed (17), Panchayat may be necessary, taking care that every pair is only (16), Zahir (16), fishery (16), former (16), junk (16), constitution (15), exiled (15), Bhattarai (14), Presley (14), Queen (14), counted once. The resulting word pairs above a threshold crown (14), dethroned (14), him (14), Arab (13), Moshoeshoe significance are put into a database where they can be (13), himself (13), pawns (13), reigning (13), Fahd (12), Nepali accessed and grouped in many different ways. As collo- (12), Rome (12), Saddam (12), once (12), pawn (12), prince cations are calculated for different language corpora, our (12), reign (12), [...] government (10) [...] examples will be taken from the English as well as the The following types of relations can be identified: German database. • Cohyponymy (e. g. Shah, queen, rook, pawn), 1.4 Visualization of Collocations • top-level syntactic relations, which translate to se- mantic ‘actor-verb’ and often used properties of a Beside textual output of collocation sets, visualizing them noun (reign; royal, crowned, dethroned), as graphs is an additional type of representation: We • instance-of (Fahd, Hussein, Moshoeshoe), choose a word and arrange its collocates in the plane so • special relations given by multiwords (A prep/det/ that collocations between collocates are taken into ac- conj B, e. g. king of Jordan), and count. This results in graphs that show homogeneity where words are interconnected and they show separation • unstructured set of words describing some subject where collocates have little in common. Linguistically area, e. g. constitution, government. speaking, polysemy is made visible (see fig. 1 below). Note that synonymy rarely occurs in the lists. The rela- Technically speaking, we use simulated annealing to tions may be classified according to the properties sym- position the words (see Davidson & Harel 1996). Line metry, anti-symmetry, and transitivity. thickness represents the significance of the collocation. Of course, all words in the graph are linked to the central 3.1 Symmetric Relations high probability) that the unknown category ? is in Let us call a relation r symmetric if r(A, B) always implies fact a instance name. Examples are: r(B, A). Examples of symmetric relations are metals like nickel, arsenic and lead • synonymy, rivers like the Ganges newspapers like Pravda • cohyponomy (or similarity), The applicability of patterns like these may heavily de- • elements of a certain subject area, and pend on language characteristics like preposition usage. • relations of unknown type. This type of extraction method is simple and well known; Usually, sentence collocations express symmetric rela- in our approach it is combined with collocation analysis, tions. thus yielding better results both in quality and in quantity (see section 5). 3.2 Anti-symmetric Relations Let us call a relation r anti-symmetric if r(A, B) never 4.2 Compounds implies r(B, A). Examples of anti-symmetric relations are German compounds consist of two (or more) words glued • hyponymy and together by varying mechanisms. The head word (coming • relations between properties and its owners like ac- second) is further determined by the first part of the com- tion and actor or class and instance. pound (modifier), which may originally be an adjective, Usually, next neighbor collocations of two words express another noun or a verb stem. In almost all cases a seman- anti-symmetric relations. In the case of next neighbor tic relation between both parts and the compound can be collocations consisting of more than two words (like A found. In section 5.3 we show how the combination of prep/det/conj B e. g. Samson and Delilah), the relation compound segmentation with collocation analysis can be might be symmetric, for instance in the case of conjunc- used for identifying named relations in compounds. tions like and or or (cf. Läuter & Quasthoff 1999). 4.3 Feature Vectors Given by Collocations and 3.3 Transitivity Clustering Transitivity of a relation means that r(A, B) and r(B, C) To investigate the meaning of a word A, its contexts in the always implies r(A, C). In general, a relation found ex- texts have to be examined because they reflect the use of perimentally will not be transitive, of course. But there A. If two words A and B have similar contexts, that is, may be a part where transitivity holds. they are alike in their use, this indicates that there is a Some of the most prominent transitive relations are the semantic relation between A and B of some kind. cohyponymy, hyponymy, synonymy, and is-a relations. A kind of average context for every word A is formed by Note that our graphical representation mainly shows tran- all collocations for A with a significance above a certain sitive relations per construction. This kind of relation is threshold. also able to give further results in the combination proce- This average context of A is transferred into a feature dures described below. vector of A using all words as features as usual. This re- sults in sparse vectors used for description. The feature 4 Other Sources for Relations vector of word A is indeed a description of the meaning of While we may intellectually identify types of semantic A, because the most important words of the contexts of A relations in collocations sets, additional information and / are included. or analysis is needed for automatically naming these rela- Clustering of feature vectors can be used to investigate the tions. In the following, we give different examples for relations between a group of similar words and to figure such complementary information. out whether or not all the relations are of the same kind. The following HACM algorithm has an additional natural 4.1 Pattern Based Relations reason to stop. It works bottom up like this: Simple pattern-based relations can be extracted from text • All words are treated as (basic) items. Each item has if knowledge about information categories like proper a description (feature vector). names is used as input. As our corpora include several • In each step of the clustering process the two items A large lists of classified terms like names of professions and B with the most similar description vectors are and last names, extraction rules may be defined: searched and fitted together to create a new complex i. Extraction of first names: item C combining the words in A and B. The scalar A pattern like (profession) ? (last name) implies product is used for determining similarity between (with high probability) that the unknown category vectors. ? is in fact a first name. Examples are Each step of the clustering algorithm reduces the actress Julia Roberts number of items by one. hockey hero Wayne Gretzky • The feature vector for C is constructed from the fea- Senator Jesse Helms ture vectors of A and B. Therefore we calculate a ii. Extraction of instance-of-relations given the class combined significance for C with respect to all words name: The pattern (class name) like ? implies (with Xi as follows: na nb subjected to the collocation analysis again and again. We sig (C, X i ) = sig (C, X i ) + sig (C, X i ) n a + nb n a + nb might expect that some of the collocational relations are strengthened while others will vanish from the iterated for all i, 1 ≤ i ≤ n with sets of collocations which we will call higher order collo- n total number of words in the corpus, cations. We describe two experiments for the iteration na number of words combined in item A, and process: Instead of plain text we start with collocation nb number of words combined in item B. sets, using sentence collocations for experiment 1 and • The algorithm stops if only one item is left or if all next neighbor collocations for experiment 2. In the case of remaining feature vectors are orthogonal. This results a symmetric relation we observe a strengthening while usually in a very natural clustering if the threshold for iterating sentence collocations. In the case of an anti- constructing the feature vectors is suitably chosen. symmetric relation we observe the same when iterating A cluster of words with probably the same semantic next neighbor collocations. relation between each of them can be found in the analy- sis tree by comparing the similarity between items inside Experiment 1: Iterating Sentence Collocations the items A and B (if these items are complex) with the The production of collocations is applied to sets of sen- calculated similarity between A and B, when fitting them tence collocations instead of sentences. E.g., the collec- together to C. If there is a large difference between them, tion of 500,000 sentence collocations has the following this is an indication for a different relation between words ‘sentence‘ (collocation set) for Hemd (shirt): Hemd combined in item A and words combined in item B. In the Krawatte Hose weißes Anzug weißem Jeans trägt trug appendix, some examples for this type of semantic clus- bekleidet weißen Jacke schwarze Jackett schwarzen Weste tering are given. kariertes Schlips Mann Symmetric clustering Example for iterated sentence collocations of Eisen (iron): If we assume that a cluster represents a semantic relation, Original collocations: Stahl, heißes, heiße, Kupfer, Man- the cluster should represent the possible symmetry and gan, alten, Feuer, Zink, Holz, Marmor transitivity of the underlying semantic relation. Iterated collocations: Kupfer, Stahl, Zink, Aluminium, Symmetry and transitivity ensure that the terms to be Magnesium, Mangan, Nickel, Blei, Zinn, Gold clustered will themselves be responsible for the cluster- As expected, the iterated collocation set only contains ing. This in turn implies that the terms found in the cluster cohyponyms. will also be found in the feature vector in prominent posi- tions. Experiment 2: Iterating Next Neighbor Collocations In example 1 (Appendix) the clustering result for January In this experiment, the production of collocations is ap- is shown. In the first column we find the terms to be plied to sets of next neighbor collocations instead of sen- clustered, on the right hand side there are the components tences. The collection of 250,000 next neighbor colloca- of the feature vectors ordered by significance. tions has the following two ‘sentences‘ for Hemd (shirt): The clustered items both appear together and share a cer- weißes weißem weißen blaues kariertes kariertem offenem tain aspect. The names of the months or weekdays as aufs karierten gestreiftes letztes [...] (left neighbors) names for periods of time cluster together, just because näher bekleidet ausgezogen spannt trägt aufknöpft ausge- they are collocates with one another. The same can be plündert auszieht wechseln aufgeknöpft ausziehen [...] shown to be true for teammates, metals, colors or fruit. (right neighbors) Example for iterated neighbor collocations of Auto (car): Anti-symmetric clustering For anti-symmetric relations the situation is different. Original collocations: fahren, Wagen, prallte, Fahrer, Again the elements of the original set to be clustered seinem, fuhr, fährt, Polizei, erfaßt, gefahren share a certain aspect, but this aspect is described by a Iterated collocations: Wagen, Lastwagen, Fahrzeug, distinct set of words. Presumably this second set of words Autos, Personenwagen, Bus, Zug, Haus, Lkw, Pkw will also cluster. Moreover, it will use the original set as Example for iterated neighbor collocations of erklärte clustering terms. (explained): This is shown in example 2 (Appendix). Here we show Original collocations: Sprecher, werde, gestern, seien, that the set given by Präsident, Vorsitzender, Vorsitzende, Wir, bereit, wolle, Vorsitzende, Anfrage, Präsident Sprecher, Sprecherin properly clusters using words like Iterated collocations: sagte, betonte, sprach, kündigte, sagte, erklärte, teilte (German verbs of utterance). wies, nannte, warnte, bekräftigte, meinte, kritisierte Conversely, in example 3 (Appendix) we find the set verwies, mitteilte, meinte, bestätigte, betonte properly Both, experiment 1 and experiment 2 result in collocation clusters using terms from the above cluster. sets carrying a homogeneous semantic relation. 4.4 Homogeneous Relations: Iterating the 5 Combining Non-contradictory Partial Collocation Process Results The extraction of collocation sets from plain text can be In section 3 we have given evidence that collocation sets viewed as some kind of information condensation. This contain various types of semantic relations without ex- process can be iterated if collocation sets themselves are plicitly naming them while section 4 has introduced a number of methods for relation extraction. This section Example: As result 1 we might know that Schwanz (tail) shows different ways of combining results of these ex- is part of Pferd (horse). Similar terms to Pferd are both traction approaches. The results of these combination give Kuh (cow) and Hund (dog) (result 2). Both of them have more and / or better results. the term Schwanz in their set of significant collocations (result 3). Hence we might correctly conjecture that both 5.1 Identical Results Kuh and Hund have a tail (Schwanz) as part of their body. Two or more of the above algorithms may suggest a cer- In contrast, Reiter (rider) is a strong collocation to Pferd tain relation between two words, for instance, cohypo- and might (incorrectly) be conjectured to be another nymy. similar concept, but Reiter is no collocation with respect Example: If both the second order collocations introduced to Schwanz. Hence, the absence of result 3 prevents us in section 4.4, and clustering by feature vectors (sec- from making an incorrect conclusion. tion 4.3) independently yield similar sets of words as a result, this may be taken as an indication of cohyponymy 5.4 Similarity Used to Infer a Strong Property between the words, e. g. sagte, betonte, kündigte, wies, Let us call an property p important, if it is preserved under nannte, warnte, bekräftigte, meinte […] (German verbs of similarity. This strong feature can be used as follows: utterance). Result 1: A has a certain important property p Result 2: B is similar to A (i. e., B is a cohyponym of A) 5.2 Supporting Second Results Conclusion: B has the same property p In the second combination type a known relation given by Example: We consider A and B as similar if they are in the one method of extraction is verified by an identical but set of right neighbor collocations of Hafenstadt (port unnamed second result as follows: town) (result 2). If we know that Hafenstadt is a property Result 1: There is certain relation r between A and B of its typical right neighbors (result 1) we may infer this Result 2: There is some strong (but unknown) relation property for more then 200 cities like Split, Sidon, between A and B (e. g. given by a collocation set) Durban, Kismayo, Tyrus, Vlora, Karachi, Durres, […]. Conclusion: Result 1 holds with more evidence. One can use this support of orthogonal tests in many 5.5 Subject Area Inferred from Collocation Sets ways: Without knowing anything about deeper language Result 1: A, B, C, ... are collocates of a certain term. structure or parsing we can filter out verbs just by testing Result 2: Some of them belong to a certain subject area. if a string accepts at least two of the endings –(e)s, -ing Conclusion: All of them belong to this subject area. and –ed/t. The recall is remarkably high. In German we Example: Consider the following top entries in the collo- tested only one mechanism of noun formation from a verb cation set of carcinoma: patients, cell, squamous, and got 70% of all verbs with a precision of 83%. radiotherapy, lung, thyroid, treated, hepatocellular, Word formation mechanisms can be explored further. In metastases, adenocarcinoma, cervix, irradiation, breast, German compound nouns are joint together to form one treatment, CT, therapy, renal, cases, bladder, cervical, word. There are several (highly irregular) patterns of tumor, cancer, metastatic, radiation, uterine, ovarian, gluing letters between the words. Testing all available chemotherapy, […] word tokens whether they could be the compound of two If we know that some of them belong to the subject area stemmed words from word lists of 93,000 current nouns Medicine, we can add this subject area to the other mem- reveals just under a million compounds in their stemmed bers of the collocation set as well. form. Here stemming accuracy is supported by the exis- tence of both compounds in the basic list. When elimi- 6 Conclusion nating a hundred words which are prone to generate In this paper, we described different approaches for the wrong separations this algorithm achieves an accuracy of extraction of named semantic relations from large text 90%. corpora. The types of relations are compatible with rela- Example: tions typically used for constructing ontologies (cf. Result 2: The German compound Entschädigungsgesetz Chandrasekaran 1999:22). The combination of different can be divided into Gesetz and Entschädigung with an types of input information as well as the application of unknown relation. robust statistical analysis methods guarantees that this Result 1 is given by the four word next neighbor colloca- approach may be applied to texts from arbitrary domains tion Gesetz über die Entschädigung. Similarly and different languages. Especially, our results may be Stundenkilometer is analyzed as Kilometer pro Stunde. used for the automatic generation of semantic relations in In these examples, result 1 is not enough because there are order to fill and expand ontology hierarchies. collocations like Woche auf dem Tisch which do not de- scribe a meaningful semantic relation. 7 References 5.3 Combining Three Results Armstrong, S. (ed.) (1993). Using Large Corpora. Computa- Result 1: There is relation r between A and B tional Linguistics 19(1/2) (1993) [Special Issue on Corpus Result 2: B is similar to B’ (cohyponymy) Processing, repr. MIT Press 1994]. Result 3: There is some strong but unknown relation be- Bentley, J.; Sedgewick, R. (1998). “Ternary Search Trees.” tween A and B’ In: Dr. Dobbs Journal, April 1998. Conclusion: There is a relation r between A and B’ Chandrasekaran, B. et al. (1999). “What are Ontologies, and Manning, Ch. D.; Schütze, H. (1999). Foundations of Statis- Why Do We Need Them?” In: Intelligent Systems 14(1) tical Language Processing. Cambridge/MA, London: The (1999), 20-26. MIT Press. Davidson, R., Harel, D. (1996). “Drawing Graphs Nicely Quasthoff, U. (1998A). “Tools for Automatic Lexicon Using Simulated Annealing.” In: ACM Transactions on Maintenance: Acquisition, Error Correction, and the Gen- Graphics 15(4), 301-331. eration of Missing Values.“ In: Proc. First International Francis, W.; Kucera, H. (1982). Frequency Analysis of Eng- Conference on Language Resources & Evaluation lish Language. Boston: Houghton Mifflin. [LREC], Granada, May 1998, Vol. II, 853-856. Heyer, G.; Quasthoff, U.; Wolff, Ch. (2000). “Aiding Web Quasthoff, U. (1998B). “Projekt der deutsche Wortschatz.” Searches by Statistical Classification Tools.“ In: Knorz, In: Heyer, G., Wolff, Ch. (eds.). Linguistik und neue Me- G.; Kuhlen, R. (eds.) (2000). Informationskompetenz - dien. Wiesbaden: Dt. Universitätsverlag, 93-99. Basiskompetenz in der Informationsgesellschaft. Proc. 7. Quasthoff, U.; Wolff, Ch. (2000). “An Infrastructure for Intern. Symposium f. Informationswissenschaft, ISI Corpus-Based Monolingual Dictionaries.” In: Proc. 2000, Darmstadt. Konstanz: UVK, 163-177. LREC-2000. Second International Conference On Lan- Krenn, B. (2000). “Distributional and Linguistic Implications guage Resources and Evaluation. Athens, May/June 2000, of Collocation Identification.” In: Proc. Collocations Vol. I, 241-246. Workshop, DGfS Conference, Marburg, March 2000. Sinclair, J. (1991). Corpus Concordance Collocation. Ox- Läuter, M., Quasthoff, U. (1999). “Kollokationen und seman- ford: Oxford University Press. tisches Clustering.” In: Gippert, J. (ed.) (1999). Multilingu- Smadja F. (1993). “Retrieving Collocations from Text: ale Corpora. Codierung, Strukturierung, Analyse. Proc. 11. Xtract.” In: Computational Linguistics 19(1) (1993), 143- GLDV-Jahrestagung. Prague: Enigma Corporation, 34-41. 177. Lemnitzer, L. (1998). “Komplexe lexikalische Einheiten in Svartvik, J. (ed.) (1992). Directions in Corpus Linguistics: Text und Lexikon.” In: Heyer, G.; Wolff, Ch. (eds.). Lin- Proc. Nobel Symposium 82, Stockholm, 4-8 August 1991. guistik und neue Medien. Wiesbaden: Dt. Universitäts- Berlin: Mouton de Gruyter [=Trends in Linguistics 65]. verlag, 85-91. van der Vet, P. E.; Mars, N. J. I. (1998). “Bottom-Up Con- struction of Ontologies.” In: IEEE Transactions on Know- ledge and Data Engineering 10(4) (1998), 513-526. 8 Appendix: Clustering Examples 8.1 Example (1): Clustering Months and Days Jahres _____________________ Uhr, Ende, abend, vergangenen, Anfang, Jahres, Samstag, Freitag, Mitte, Sonntag Donnerstag _ | Uhr, abend, heutigen, Nacht, teilte, Mittwoch, Freitag, worden, mitteilte, sagte Dienstag _|_ | Uhr, abend, heutigen, teilte, Freitag, worden, kommenden, sagte, mitteilte, Nacht Montag _ | | Uhr, abend, heutigen, Dienstag, kommenden, teilte, Freitag, worden, sagte, morgen Mittwoch _|_|_ | Uhr, abend, heutigen, Nacht, Samstag, Freitag, Sonntag, kommenden, nachmittag Samstag ___ | | Uhr, abend, Samstag, Nacht, Sonntag, Freitag, Montag, nachmittag, heutigen Sonntag _ | | | Uhr, abend, Samstag, Nacht, Montag, kommenden, morgen, nachmittag, vergangenen Freitag _|_|_|_____________ | Uhr, abend, Ende, Jahres, Samstag, Anfang, Freitag, Sonntag, heutigen, worden Januar _________________ | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, August, März, Januar August _______________ | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, August, Januar, März Juli _____________ | | | | Uhr, Jahres, Ende, Anfang, Mitte, Mai, Samstag, August, Januar, März März ___________ | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, Mai, Januar, März, April Mai _________ | | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Samstag, März, Januar, Mai, vergangenen September _______ | | | | | | | Uhr, Ende, Jahres, Anfang, Mitte, Mai, Januar, März, Samstag, vergangenen Februar _ | | | | | | | | Uhr, Januar, Jahres, Anfang, Mitte, Ende, März, November, Samstag, vergangenen Dezember _|___ | | | | | | | | Uhr, Jahres, Ende, Anfang, Mitte, Mai, Januar, März, Samstag, vergangenen November _ | | | | | | | | | Uhr, Jahres, Ende, Anfang, Mitte, September, vergangenen, Dezember, Samstag Oktober _|_ | | | | | | | | | Uhr, Ende, Jahres, Anfang, Mai, Mitte, Samstag, September, März, vergangenen April _ | | | | | | | | | | Uhr, Ende, Jahres, Mai, Anfang, März, Mitte, Prozent, Samstag, Hauptversammlung Juni _|_|_|_|_|_|_|_|_|_|_|_ 8.2 Example (2): Clustering Leaders Präsident _________ sagte, Boris Jelzin, erklärte, stellvertretende, Bill Clinton, stellvertretender, Richter Vorsitzender _______ | sagte, erklärte, stellvertretende, stellvertretender, Richter, Abteilung, bestätigte Vorsitzende ___ | | sagte, erklärte, stellvertretende, Richter, bestätigte, Außenministeriums, teilte, gestern Sprecher _ | | | sagte, erklärte, Außenministeriums, bestätigte, teilte, gestern, mitteilte, Anfrage Sprecherin _|_|_ | | sagte, erklärte, stellvertretende, Richter, Abteilung, bestätigte, Außenministeriums, sagt Chef _ | | | Abteilung, Instituts, sagte, sagt, stellvertretender, Professor, Staatskanzlei, Dr. Leiter _|___|_|_|_ 8.3 Example (3): Clustering Verbs of Utterance verwies _____________ Sprecher, werde, gestern, Vorsitzende, Polizei, Sprecherin, Anfrage, Präsident, gebe mitteilte ___________ | Sprecher, werde, gestern, Vorsitzende, Polizei, Sprecherin, Anfrage, Präsident, Montag meinte _______ | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Anfrage, Präsident, gebe, Interview bestätigte_____ | | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Anfrage, Präsident, gebe, Interview betonte ___ | | | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, gebe, Interview, würden, Bonn sagte _ | | | | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, gebe, Interview, würden erklärte _|_|_|_|_ | | Sprecher, werde, gestern, Vorsitzende, Sprecherin, Präsident, Anfrage, gebe, Interview warnte _ | | | Präsident, Vorsitzende, SPD, eindringlich, Ministerpräsident, CDU, Außenminister, Zugleich sprach _|_______|_|_|_