<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Learning Relations using Collocations</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Gerhard</forename><surname>Heyer</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Natural Language Processing Department</orgName>
								<orgName type="institution">Christian Wolff Leipzig University Computer Science Institute</orgName>
								<address>
									<addrLine>Augustusplatz 10 / 11</addrLine>
									<postCode>D-04109</postCode>
									<settlement>Leipzig</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Martin</forename><surname>Läuter</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Natural Language Processing Department</orgName>
								<orgName type="institution">Christian Wolff Leipzig University Computer Science Institute</orgName>
								<address>
									<addrLine>Augustusplatz 10 / 11</addrLine>
									<postCode>D-04109</postCode>
									<settlement>Leipzig</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Uwe</forename><surname>Quasthoff</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Natural Language Processing Department</orgName>
								<orgName type="institution">Christian Wolff Leipzig University Computer Science Institute</orgName>
								<address>
									<addrLine>Augustusplatz 10 / 11</addrLine>
									<postCode>D-04109</postCode>
									<settlement>Leipzig</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Thomas</forename><surname>Wittig</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Natural Language Processing Department</orgName>
								<orgName type="institution">Christian Wolff Leipzig University Computer Science Institute</orgName>
								<address>
									<addrLine>Augustusplatz 10 / 11</addrLine>
									<postCode>D-04109</postCode>
									<settlement>Leipzig</settlement>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Learning Relations using Collocations</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">08E0526D663762CE01647DCE88F6012B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T17:34+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper describes the application of statistical analysis of large corpora to the problem of extracting semantic relations from unstructured text. We regard this approach as a viable method for generating input for the construction of ontologies as ontologies use well-defined semantic relations as building blocks (cf. van der Vet &amp; Mars 1998). Starting from a short description of our corpora as well as our language analysis tools, we discuss in depth the automatic generation of collocation sets. We further give examples of different types of relations that may be found in collocation sets for arbitrary terms. The central question we deal with here is how to postprocess statistically generated collocation sets in order to extract named relations. We show that for different types of relations like cohyponyms or instance-of-relations, different extraction methods as well as additional sources of information can be applied to the basic collocation sets in order to verify the existence of a specific type of semantic relation for a given set of terms.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Analysis of Large Text Corpora</head><p>Corpus Linguistics is generally understood as a branch of computational linguistics dealing with large text corpora for the purpose of statistical processing of language data (cf. <ref type="bibr">Armstrong 1993</ref><ref type="bibr" target="#b9">, Manning &amp; Schütze 1999)</ref>. With the availability of large text corpora and the success of robust corpus processing in the nineties, this approach has recently become increasingly popular among computational linguists (cf. <ref type="bibr" target="#b13">Sinclair 1991</ref><ref type="bibr">, Svartvik 1992)</ref>. Since 1995 a German text corpus of more than 300 million words has been collected (cf. <ref type="bibr">Quasthoff 1998B, Quasthoff &amp; Wolff 2000)</ref>, containing approx. 6 million different word forms in approx. 13 million sentences, which serves as input for the analysis methods described below. Similarly structured corpora have recently been set up for other European languages as well <ref type="bibr">(English, French, Dutch)</ref>, with more languages to follow in the near future (see table <ref type="table">1</ref>). Table <ref type="table">1</ref>: Basic Characteristics of the Corpora The basic goal of this corpus-based approach is to collect large amounts of textual data as input for semantic processing. Starting off from a rather simple data model tailored for large amounts of data and efficient processing using a relational data base system at storage level we employ a simple yet powerful technical infrastructure for processing texts to be included in the corpus. Beside basic procedures for text integration into the corpus various tools have been developed for post-processing linguistic data. Among them the automatic calculation of sentence-based word collocations stands out as an especially valuable tool for corpus-based language technology applications (see <ref type="bibr">Quasthoff 1998A, Quasthoff &amp; Wolff 2000)</ref>. Additional, application oriented tools exist for search engine optimization as well as automatic document classification (see <ref type="bibr" target="#b5">Heyer, Quasthoff &amp; Wolff 2000)</ref>. The corpora are available on the WWW (http://www. wortschatz. uni-leipzig.de) and may be used as a large online dictionary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Collocations</head><p>The occurrence of two or more words within a welldefined unit of information (sentence, document) is called a collocation. For the selection of meaningful and significant collocations, an adequate collocation measure has to be defined. In the literature, quite a number of different collocation measures can be found; for an in-depth discussion of various collocation measures and their application cf. <ref type="bibr" target="#b14">Smadja 1993</ref><ref type="bibr" target="#b8">, Lemnitzer 1998</ref><ref type="bibr" target="#b6">, Krenn 2000</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">The Collocation Measure</head><p>In the following, our approach towards measuring the significance of the joint occurrence of two words A and B in a sentence is discussed. Let a, b be the number of sentences containing A and B, k be the number of sentences containing both A and B, n be the total number of sentences.</p><p>Our significance measure calculates the probability of joint occurrence of rare events. The results of this measure are quite similar to the well-known log-likelihoodmeasure (cf. Krenn 2000):</p><p>Let x = ab/n and define:</p><formula xml:id="formula_0">1 0 1 log 1 e ! sig( , ) . log k x i i x i A B n − − =   − − ⋅     =</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>∑</head><p>For 2x &lt; k, we get the following approximation which is much easier to calculate: sig(A,B) = (x -k log x + log k!) / log n In the case of next neighbor collocations we replace the definition of the above variables by the following. Instead of a sentence we consider pairs (A, B) of words which are next neighbors in this sentence. Hence, instead of one sentence of n words we have n -1 pairs. For right neighbor collocations (A, B) let a, b be the number of pairs of type (A, ?) and (?, B) resp., k be the number of pairs (A, B), n be the total number of pairs. This equals the total number of running words minus the number of sentences. Given these variables, the significance measure is calculated as shown above. In general, this measure yields semantically acceptable collocation sets for values above an empirically determined positive threshold (see examples in section 3 below).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Properties of the Collocation Measure</head><p>In order to describe basic properties of this measure, we write sig <ref type="bibr">(n, k, a, b)</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Finding Collocations</head><p>For calculating the collocation measure for any reasonable pairs we first count the joint occurrences of each pair. This problem is complex both in time and storage. Nevertheless, we managed to calculate the collocation measure for any pair with total frequency of at least 3 for each component. Our approach is based on extensible ternary search trees (cf. <ref type="bibr" target="#b1">Bentley &amp; Sedgewick 1998)</ref> where a count can be associated to a pair of word numbers. The memory overhead from the original implementation could be reduced by allocating the space for chunks of 100,000 nodes at once. Even when using this technique on a large memory computer more than one run through the corpus may be necessary, taking care that every pair is only counted once. The resulting word pairs above a threshold significance are put into a database where they can be accessed and grouped in many different ways. As collocations are calculated for different language corpora, our examples will be taken from the English as well as the German database.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.4">Visualization of Collocations</head><p>Beside textual output of collocation sets, visualizing them as graphs is an additional type of representation: We choose a word and arrange its collocates in the plane so that collocations between collocates are taken into account. This results in graphs that show homogeneity where words are interconnected and they show separation where collocates have little in common. Linguistically speaking, polysemy is made visible (see fig. <ref type="figure">1</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>below).</head><p>Technically speaking, we use simulated annealing to position the words (see <ref type="bibr" target="#b3">Davidson &amp; Harel 1996)</ref>. Line thickness represents the significance of the collocation. Of course, all words in the graph are linked to the central word, the rest of the picture is automatically computed, but represents semantic connectedness surprisingly well.</p><p>Unfortunately the relations between the words are just presented, but not yet named. Fig. <ref type="figure">1</ref> shows the collocation graph for space. Three different meaning contexts can be recognized in the graph:</p><p>• real estate,</p><p>• computer hardware, and • astronautics.</p><p>The connection between address and memory results from the fact that address is another polysemous concept.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Fig. 1: Collocation Graph for space 3 Relations Represented by Collocations</head><p>If we fix one word and look at its set of collocates, then some semantic relations appear more often than others.</p><p>The following example shows the most significant collocations for king ordered by significance:</p><p>queen (90), mackerel <ref type="bibr">(83),</ref><ref type="bibr">hill (49),</ref><ref type="bibr">Milken (47),</ref><ref type="bibr">royal (44),</ref><ref type="bibr">monarch (33)</ref>, <ref type="bibr">King (30),</ref><ref type="bibr">crowned (30)</ref>, migratory (30), rook (29), throne (29), <ref type="bibr">Jordanian (26),</ref><ref type="bibr">Hussein (25)</ref>, <ref type="bibr">Saudi (25),</ref><ref type="bibr">monarchy (25),</ref><ref type="bibr">crab (23)</ref>, <ref type="bibr">Jordan (22)</ref>, <ref type="bibr">Lekhanya (21)</ref>, <ref type="bibr">Prince (21)</ref>, <ref type="bibr">Michael (20),</ref><ref type="bibr">Jordan's (19),</ref><ref type="bibr">palace (19),</ref><ref type="bibr">undisputed (18),</ref><ref type="bibr">Elvis (17),</ref><ref type="bibr">Shah (17),</ref><ref type="bibr">deposed (17),</ref><ref type="bibr">Panchayat (16),</ref><ref type="bibr">Zahir (16),</ref><ref type="bibr">fishery (16),</ref><ref type="bibr">former (16),</ref><ref type="bibr">junk (16),</ref><ref type="bibr">constitution (15),</ref><ref type="bibr">exiled (15),</ref><ref type="bibr">Bhattarai (14),</ref><ref type="bibr">Presley (14),</ref><ref type="bibr">Queen (14),</ref><ref type="bibr">crown (14),</ref><ref type="bibr">dethroned (14),</ref><ref type="bibr">him (14),</ref><ref type="bibr">Arab (13)</ref>, <ref type="bibr">Moshoeshoe (13),</ref><ref type="bibr">himself (13),</ref><ref type="bibr">pawns (13),</ref><ref type="bibr">reigning (13),</ref><ref type="bibr">Fahd (12),</ref><ref type="bibr">Nepali (12),</ref><ref type="bibr">Rome (12),</ref><ref type="bibr">Saddam (12),</ref><ref type="bibr">once (12),</ref><ref type="bibr">pawn (12)</ref>, prince (12), reign ( <ref type="formula">12</ref>),</p><formula xml:id="formula_1">[...] government (10) [...]</formula><p>The following types of relations can be identified:</p><p>• Cohyponymy (e. g. Shah, queen, rook, pawn),</p><p>• top-level syntactic relations, which translate to semantic 'actor-verb' and often used properties of a noun (reign; royal, crowned, dethroned), • instance-of (Fahd, Hussein, Moshoeshoe),</p><p>• special relations given by multiwords (A prep/det/ conj B, e. g. king of Jordan), and • unstructured set of words describing some subject area, e. g. constitution, government. Note that synonymy rarely occurs in the lists. The relations may be classified according to the properties symmetry, anti-symmetry, and transitivity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Symmetric Relations</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Let us call a relation r symmetric if r(A, B) always implies r(B, A). Examples of symmetric relations are • synonymy,</head><p>• cohyponomy (or similarity),</p><p>• elements of a certain subject area, and • relations of unknown type. Usually, sentence collocations express symmetric relations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Anti-symmetric Relations</head><p>Let us call a relation r anti-symmetric if r(A, B) never implies r(B, A). Examples of anti-symmetric relations are • hyponymy and • relations between properties and its owners like action and actor or class and instance. Usually, next neighbor collocations of two words express anti-symmetric relations. In the case of next neighbor collocations consisting of more than two words (like A prep/det/conj B e. g. Samson and Delilah), the relation might be symmetric, for instance in the case of conjunctions like and or or (cf. <ref type="bibr" target="#b7">Läuter &amp; Quasthoff 1999)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Transitivity</head><p>Transitivity of a relation means that r(A, B) and r(B, C) always implies r(A, C). In general, a relation found experimentally will not be transitive, of course. But there may be a part where transitivity holds. Some of the most prominent transitive relations are the cohyponymy, hyponymy, synonymy, and is-a relations. Note that our graphical representation mainly shows transitive relations per construction. This kind of relation is also able to give further results in the combination procedures described below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Other Sources for Relations</head><p>While we may intellectually identify types of semantic relations in collocations sets, additional information and / or analysis is needed for automatically naming these relations. In the following, we give different examples for such complementary information. The applicability of patterns like these may heavily depend on language characteristics like preposition usage. This type of extraction method is simple and well known; in our approach it is combined with collocation analysis, thus yielding better results both in quality and in quantity (see section 5).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Pattern Based Relations</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Compounds</head><p>German compounds consist of two (or more) words glued together by varying mechanisms. The head word (coming second) is further determined by the first part of the compound (modifier), which may originally be an adjective, another noun or a verb stem. In almost all cases a semantic relation between both parts and the compound can be found. In section 5.3 we show how the combination of compound segmentation with collocation analysis can be used for identifying named relations in compounds.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Feature Vectors Given by Collocations and Clustering</head><p>To investigate the meaning of a word A, its contexts in the texts have to be examined because they reflect the use of A. If two words A and B have similar contexts, that is, they are alike in their use, this indicates that there is a semantic relation between A and B of some kind. A kind of average context for every word A is formed by all collocations for A with a significance above a certain threshold. This average context of A is transferred into a feature vector of A using all words as features as usual. This results in sparse vectors used for description. The feature vector of word A is indeed a description of the meaning of A, because the most important words of the contexts of A are included. Clustering of feature vectors can be used to investigate the relations between a group of similar words and to figure out whether or not all the relations are of the same kind.</p><p>The following HACM algorithm has an additional natural reason to stop. It works bottom up like this:</p><p>• All words are treated as (basic) items. Each item has a description (feature vector). • In each step of the clustering process the two items A and B with the most similar description vectors are searched and fitted together to create a new complex item C combining the words in A and B. The scalar product is used for determining similarity between vectors.</p><p>Each step of the clustering algorithm reduces the number of items by one. • The feature vector for C is constructed from the feature vectors of A and B. Therefore we calculate a combined significance for C with respect to all words X i as follows:</p><formula xml:id="formula_2">) X (C, n n n ) X (C, n n n ) X (C, i b a b i b a a i sig sig sig + + + =</formula><p>for all i, 1 ≤ i ≤ n with n total number of words in the corpus, n a number of words combined in item A, and n b number of words combined in item B.</p><p>• The algorithm stops if only one item is left or if all remaining feature vectors are orthogonal. This results usually in a very natural clustering if the threshold for constructing the feature vectors is suitably chosen. A cluster of words with probably the same semantic relation between each of them can be found in the analysis tree by comparing the similarity between items inside the items A and B (if these items are complex) with the calculated similarity between A and B, when fitting them together to C. If there is a large difference between them, this is an indication for a different relation between words combined in item A and words combined in item B. In the appendix, some examples for this type of semantic clustering are given.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Symmetric clustering</head><p>If we assume that a cluster represents a semantic relation, the cluster should represent the possible symmetry and transitivity of the underlying semantic relation. Symmetry and transitivity ensure that the terms to be clustered will themselves be responsible for the clustering. This in turn implies that the terms found in the cluster will also be found in the feature vector in prominent positions.</p><p>In example 1 (Appendix) the clustering result for January is shown. In the first column we find the terms to be clustered, on the right hand side there are the components of the feature vectors ordered by significance. The clustered items both appear together and share a certain aspect. The names of the months or weekdays as names for periods of time cluster together, just because they are collocates with one another. The same can be shown to be true for teammates, metals, colors or fruit.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Anti-symmetric clustering</head><p>For anti-symmetric relations the situation is different. Again the elements of the original set to be clustered share a certain aspect, but this aspect is described by a distinct set of words. Presumably this second set of words will also cluster. Moreover, it will use the original set as clustering terms. This is shown in example 2 (Appendix). Here we show that the set given by Präsident, Vorsitzender, Vorsitzende, Sprecher, Sprecherin properly clusters using words like sagte, erklärte, teilte (German verbs of utterance). Conversely, in example 3 (Appendix) we find the set verwies, mitteilte, meinte, bestätigte, betonte properly clusters using terms from the above cluster.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Homogeneous Relations: Iterating the Collocation Process</head><p>The extraction of collocation sets from plain text can be viewed as some kind of information condensation. This process can be iterated if collocation sets themselves are subjected to the collocation analysis again and again. We might expect that some of the collocational relations are strengthened while others will vanish from the iterated sets of collocations which we will call higher order collocations. We describe two experiments for the iteration process: Instead of plain text we start with collocation sets, using sentence collocations for experiment 1 and next neighbor collocations for experiment 2. In the case of a symmetric relation we observe a strengthening while iterating sentence collocations. In the case of an antisymmetric relation we observe the same when iterating next neighbor collocations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experiment 1: Iterating Sentence Collocations</head><p>The production of collocations is applied to sets of sentence collocations instead of sentences. E.g., the collection of 500,000 sentence collocations has the following 'sentence' (collocation set) for Hemd (shirt): </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Combining Non-contradictory Partial Results</head><p>In section 3 we have given evidence that collocation sets contain various types of semantic relations without explicitly naming them while section 4 has introduced a number of methods for relation extraction. This section shows different ways of combining results of these extraction approaches. The results of these combination give more and / or better results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Identical Results</head><p>Two or more of the above algorithms may suggest a certain relation between two words, for instance, cohyponymy.</p><p>Example: If both the second order collocations introduced in section 4.4, and clustering by feature vectors (section 4.3) independently yield similar sets of words as a result, this may be taken as an indication of cohyponymy between the words, e. g. sagte, betonte, kündigte, wies, nannte, warnte, bekräftigte, meinte […] (German verbs of utterance).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Supporting Second Results</head><p>In the second combination type a known relation given by one method of extraction is verified by an identical but unnamed second result as follows:</p><p>Result 1: In these examples, result 1 is not enough because there are collocations like Woche auf dem Tisch which do not describe a meaningful semantic relation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Combining Three Results</head><p>Result 1: There is relation r between A and B Result 2: B is similar to B' (cohyponymy) Result 3: There is some strong but unknown relation between A and B' Conclusion: There is a relation r between A and B'</p><p>Example: As result 1 we might know that Schwanz (tail) is part of Pferd (horse). Similar terms to Pferd are both Kuh (cow) and Hund (dog) (result 2). Both of them have the term Schwanz in their set of significant collocations (result 3). Hence we might correctly conjecture that both Kuh and Hund have a tail (Schwanz) as part of their body. In contrast, Reiter (rider) is a strong collocation to Pferd and might (incorrectly) be conjectured to be another similar concept, but Reiter is no collocation with respect to Schwanz. Hence, the absence of result 3 prevents us from making an incorrect conclusion. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Similarity</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5">Subject Area Inferred from Collocation Sets</head><p>Result 1: A, B, C, ... are collocates of a certain term.</p><p>Result 2: Some of them belong to a certain subject area. Conclusion: All of them belong to this subject area.</p><p>Example: Consider the following top entries in the collocation set of carcinoma <ref type="bibr">: patients, cell, squamous, radiotherapy, lung, thyroid, treated, hepatocellular, metastases, adenocarcinoma, cervix, irradiation, breast, treatment, CT, therapy, renal, cases, bladder, cervical, tumor, cancer, metastatic, radiation, uterine, ovarian, chemotherapy, […]</ref> If we know that some of them belong to the subject area Medicine, we can add this subject area to the other members of the collocation set as well.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>In this paper, we described different approaches for the extraction of named semantic relations from large text corpora. The types of relations are compatible with relations typically used for constructing ontologies (cf. <ref type="bibr">Chandrasekaran 1999:22)</ref>. The combination of different types of input information as well as the application of robust statistical analysis methods guarantees that this approach may be applied to texts from arbitrary domains and different languages. Especially, our results may be used for the automatic generation of semantic relations in order to fill and expand ontology hierarchies.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head></head><label></label><figDesc>Simple pattern-based relations can be extracted from text if knowledge about information categories like proper names is used as input. As our corpora include several large lists of classified terms like names of professions and last names, extraction rules may be defined: Extraction of instance-of-relations given the class name: The pattern (class name) like ? implies (with high probability) that the unknown category ? is in fact a instance name. Examples are: metals like nickel, arsenic and lead rivers like the Ganges newspapers like Pravda</figDesc><table><row><cell>i. Extraction of first names:</cell></row><row><cell>A pattern like (profession) ? (last name) implies</cell></row><row><cell>(with high probability) that the unknown category</cell></row><row><cell>? is in fact a first name. Examples are</cell></row><row><cell>actress Julia Roberts</cell></row><row><cell>hockey hero Wayne Gretzky</cell></row><row><cell>Senator Jesse Helms</cell></row><row><cell>ii.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head></head><label></label><figDesc>There is certain relation r between A and BResult 2: There is some strong (but unknown) relation between A and B (e. g. given by a collocation set) Conclusion: Result 1 holds with more evidence. One can use this support of orthogonal tests in many ways: Without knowing anything about deeper language structure or parsing we can filter out verbs just by testing if a string accepts at least two of the endings -(e)s, -ing and -ed/t. The recall is remarkably high. In German we tested only one mechanism of noun formation from a verb and got 70% of all verbs with a precision of 83%. The German compound Entschädigungsgesetz can be divided into Gesetz and Entschädigung with an unknown relation. Result 1 is given by the four word next neighbor collocation Gesetz über die Entschädigung. Similarly Stundenkilometer is analyzed as Kilometer pro Stunde.</figDesc><table><row><cell>Word formation mechanisms can be explored further. In</cell></row><row><cell>German compound nouns are joint together to form one</cell></row><row><cell>word. There are several (highly irregular) patterns of</cell></row><row><cell>gluing letters between the words. Testing all available</cell></row><row><cell>word tokens whether they could be the compound of two</cell></row><row><cell>stemmed words from word lists of 93,000 current nouns</cell></row><row><cell>reveals just under a million compounds in their stemmed</cell></row><row><cell>form. Here stemming accuracy is supported by the exis-</cell></row><row><cell>tence of both compounds in the basic list. When elimi-</cell></row><row><cell>nating a hundred words which are prone to generate</cell></row><row><cell>wrong separations this algorithm achieves an accuracy of</cell></row><row><cell>90%.</cell></row><row><cell>Example:</cell></row><row><cell>Result 2:</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Used to Infer a Strong Property Let</head><label></label><figDesc>us call an property p important, if it is preserved under similarity. This strong feature can be used as follows: Result 1: A has a certain important property p Result 2: B is similar to A (i. e., B is a cohyponym of A) Conclusion: B has the same property p Example: We consider A and B as similar if they are in the set of right neighbor collocations of Hafenstadt (port town) (result 2). If we know that Hafenstadt is a property of its typical right neighbors (result 1) we may infer this property for more then 200 cities like Split, Sidon, Durban, Kismayo, Tyrus, Vlora, Karachi, Durres, […].</figDesc><table /></figure>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0" />			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Using Large Corpora</title>
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<editor>References Armstrong, S.</editor>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="issue">1</biblScope>
			<date type="published" when="1993">1993. /2. 1993. 1994</date>
			<publisher>MIT Press</publisher>
		</imprint>
	</monogr>
	<note>Special Issue on Corpus Processing, repr</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Ternary Search Trees</title>
		<author>
			<persName><forename type="first">J</forename><surname>Bentley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Sedgewick</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Dr. Dobbs Journal</title>
		<imprint>
			<date type="published" when="1998-04">1998. April 1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">What are Ontologies, and Why Do We Need Them?</title>
		<author>
			<persName><forename type="first">B</forename><surname>Chandrasekaran</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Intelligent Systems</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="20" to="26" />
			<date type="published" when="1999">1999. 1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Drawing Graphs Nicely Using Simulated Annealing</title>
		<author>
			<persName><forename type="first">R</forename><surname>Davidson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Harel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Graphics</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="301" to="331" />
			<date type="published" when="1996">1996</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Frequency Analysis of English Language</title>
		<author>
			<persName><forename type="first">W</forename><surname>Francis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Kucera</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1982">1982</date>
			<publisher>Houghton Mifflin</publisher>
			<pubPlace>Boston</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Aiding Web Searches by Statistical Classification Tools</title>
		<author>
			<persName><forename type="first">G</forename><surname>Heyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Quasthoff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ch</forename><surname>Wolff</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Informationskompetenz -Basiskompetenz in der Informationsgesellschaft. Proc. 7. Intern. Symposium f. Informationswissenschaft, ISI 2000</title>
				<editor>
			<persName><forename type="first">G</forename><surname>Knorz</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Kuhlen</surname></persName>
		</editor>
		<meeting><address><addrLine>Darmstadt. Konstanz</addrLine></address></meeting>
		<imprint>
			<publisher>UVK</publisher>
			<date type="published" when="2000">2000. 2000</date>
			<biblScope unit="page" from="163" to="177" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Distributional and Linguistic Implications of Collocation Identification</title>
		<author>
			<persName><forename type="first">B</forename><surname>Krenn</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. Collocations Workshop, DGfS Conference</title>
				<meeting>Collocations Workshop, DGfS Conference<address><addrLine>Marburg</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2000-03">2000. March 2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Kollokationen und semantisches Clustering</title>
		<author>
			<persName><forename type="first">M</forename><surname>Läuter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Quasthoff</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Multilinguale Corpora. Codierung, Strukturierung, Analyse. Proc. 11. GLDV-Jahrestagung</title>
				<editor>
			<persName><forename type="first">J</forename><surname>Gippert</surname></persName>
		</editor>
		<meeting><address><addrLine>Prague</addrLine></address></meeting>
		<imprint>
			<publisher>Enigma Corporation</publisher>
			<date type="published" when="1999">1999. 1999</date>
			<biblScope unit="page" from="34" to="41" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Komplexe lexikalische Einheiten in Text und Lexikon</title>
		<author>
			<persName><forename type="first">L</forename><surname>Lemnitzer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Linguistik und neue Medien</title>
				<editor>
			<persName><forename type="first">G</forename><surname>Heyer</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Ch</forename><surname>Wolff</surname></persName>
		</editor>
		<meeting><address><addrLine>Wiesbaden</addrLine></address></meeting>
		<imprint>
			<publisher>Dt. Universitätsverlag</publisher>
			<date type="published" when="1998">1998</date>
			<biblScope unit="page" from="85" to="91" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Foundations of Statistical Language Processing</title>
		<author>
			<persName><forename type="first">Ch</forename><forename type="middle">D</forename><surname>Manning</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schütze</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1999">1999</date>
			<publisher>The MIT Press</publisher>
			<pubPlace>Cambridge/MA, London</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Tools for Automatic Lexicon Maintenance: Acquisition, Error Correction, and the Generation of Missing Values</title>
		<author>
			<persName><forename type="first">U</forename><surname>Quasthoff</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. First International Conference on Language Resources &amp; Evaluation [LREC</title>
				<meeting>First International Conference on Language Resources &amp; Evaluation [LREC<address><addrLine>Granada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1998-05">1998A. May 1998</date>
			<biblScope unit="volume">II</biblScope>
			<biblScope unit="page" from="853" to="856" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Projekt der deutsche Wortschatz</title>
		<author>
			<persName><forename type="first">U</forename><surname>Quasthoff</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Linguistik und neue Medien</title>
				<editor>
			<persName><forename type="first">G</forename><surname>Heyer</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Ch</forename><surname>Wolff</surname></persName>
		</editor>
		<meeting><address><addrLine>Wiesbaden</addrLine></address></meeting>
		<imprint>
			<publisher>Dt. Universitätsverlag</publisher>
			<date type="published" when="1998">1998B</date>
			<biblScope unit="page" from="93" to="99" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">An Infrastructure for Corpus-Based Monolingual Dictionaries</title>
		<author>
			<persName><forename type="first">U</forename><surname>Quasthoff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ch</forename><surname>Wolff</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. LREC-2000. Second International Conference On Language Resources and Evaluation</title>
				<meeting>LREC-2000. Second International Conference On Language Resources and Evaluation<address><addrLine>Athens</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2000-05">2000. May/June 2000</date>
			<biblScope unit="volume">I</biblScope>
			<biblScope unit="page" from="241" to="246" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Corpus Concordance Collocation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Sinclair</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Oxford</title>
				<imprint>
			<publisher>Oxford University Press</publisher>
			<date type="published" when="1991">1991</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Retrieving Collocations from Text: Xtract</title>
		<author>
			<persName><forename type="first">F</forename><surname>Smadja</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="143" to="177" />
			<date type="published" when="1993">1993. 1993</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Directions in Corpus Linguistics</title>
	</analytic>
	<monogr>
		<title level="m">Proc. Nobel Symposium 82</title>
				<editor>
			<persName><forename type="first">J</forename><surname>Svartvik</surname></persName>
		</editor>
		<meeting>Nobel Symposium 82<address><addrLine>Stockholm; Berlin</addrLine></address></meeting>
		<imprint>
			<publisher>Mouton de Gruyter</publisher>
			<date type="published" when="1991-08">1992. 4-8 August 1991</date>
		</imprint>
	</monogr>
	<note>=Trends in Linguistics 65</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Bottom-Up Construction of Ontologies</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">E</forename><surname>Van Der Vet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">J I</forename><surname>Mars</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Knowledge and Data Engineering</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="513" to="526" />
			<date type="published" when="1998">1998. 1998</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
