<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main"></title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">E24F50F29EDA5A7B9B106D0AEAA0C076</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T12:26+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This position paper presents an algorithm, which determines similarities between text documents. These text documents are indexed with keywords and further background knowledge-terms from an ontology.The representation of the documents and the evaluation of the algorithm are used to let an ontology learn. This is shown to be one way of improving the results of the algorithm by improving the background knowledge.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Consider a human being reading texts from domains, which to a certain extent are familiar to him or her. The reader is capable of the semantics of the text documents. Even if the person is not an expert in any of the domains described in the texts, a minimal comment we expect him or her to state is, weather two texts are similar or not. This kind of judgement also includes text documents, which possess similarities though containing a completely different vocabulary or sharing just a few common terms. Similarities are a part of the intellectual construction of reality <ref type="bibr" target="#b4">[5]</ref> and generated by what words and phrases the human mind associates to the actual text.</p><p>In a business application grouping documents by their similarity undergoes restrictions: the job has to be done fast, for instance managing the continuous flow of short messages coming in to the editors of a newspaper. Moreover, the document base in use by the newspaper is too large, so an editor is not able to retrieve all similar texts in time.</p><p>We apply the above situation to a computer instead of a human reader. Our goal is to express similarities of text documents detected by an algorithm. Hence a semantic matching problem is to be solved. The associations and heuristics recognizing similarities beyond equalities of character strings have to be modeled somehow, otherwise we are restricted to plain full text retrieval <ref type="bibr" target="#b9">[10]</ref>, like many of the webbased search-engines taking HTML as an input.</p><p>The following paper yields some propositions about a process, in which an algorithm obtains a value of similarity from a pair of text documents. Before we describe the algorithm, we take a brief look at how the documents first have to be made readable to the algorithm and in which fashion background knowledge adds further information to the matching process. Then we explain the algorithm: its way of matching documents and the parameters in need. Finally we give some hints concerning the evaluation and improvement of the algorithm. This will be the point, where background knowledge gets affected by our results and we will distinguish objective and subjective influences on the background knowledge.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">PREPROCESSING THE DATA</head><p>We consider a corpus of short text documents to be given. Any document D is attached with a vector v(D) including a description of its contents. The vector is a result of abstracting a text into descriptorsthis can be done either by a knowledge worker orkeeping in mind the constraints from the business application we referred to in the introduction-by an automatic indexing <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b8">9]</ref>. Note that our approach only works in case of a controlled vocabulary of descriptors. Furthermore we discuss a type of background knowledge meeting the requirements of an ontology.</p><p>To keep our discourse comprehensive we define an ontology to be a set of terms and their relationships. An example of building such an ontology in an object-oriented fashion can be found in <ref type="bibr" target="#b7">[8]</ref>, for diverse definitions of an ontology we refer to <ref type="bibr" target="#b10">[11]</ref>.</p><p>To be precise, possible vector entries (index terms) in v(D) must represent a controlled vocabulary V to keep them computer-readable and capable of comparisons. The index terms of the vocabulary V are Background Knowledge, Indexing and Matching-</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Interdependencies of Document Management and Ontology-Maintenance</head><p>Andreas Faatz<ref type="foot" target="#foot_3">1</ref> , Thomas Kamps 2 , Ralf Steinmetz 3 exactly the concepts of a predefined ontology, connected by the ontological relations. The relations we perform with are typed semantic ones like 'is subconcept of', 'is differential of' or 'is associated with'. Example of an index vector: imagine a textdocument D describing the German chancellor Schröder visiting the U.S., where he meets President Clinton and argues with him about the chair of the IMF. The vectorial representation V(D) is: </p><formula xml:id="formula_0">V(D)</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">SEMANTIC MATCHING 3.1 The algorithm</head><p>In contrast to classical full text retrieval technology our method provides more structure. As was to be seen from the last paragraph we include background knowledge, which delivers more than synonyms. A first version of the matching algorithm deals with a type of overlap-measuring of the entries of a pair of vectors. We named the measure 'frequency' because of the way its functionality was implemented in the Smalltalk programming language.</p><p>Let us define a frequency measure of the similarity of two sets of words as the number of words appearing in both sets (whereby every repetition of a word is extra-counted) divided by the total of all words. An example: (sun, sun, rain) and (sun, sun, snow) have the frequency 4/6.</p><p>The output S(Q,P) of the matching algorithm is the similarity of a pair of documents. In fact it is a weighted sum of similarities S(a,f),...,S(b,i), where a,...,d are the collections of keywords (i.e. the vecto-rial entries) from the first index vector V(P) and f,...,i are the collections of keywords from the second vector V(Q). We assume the operation on the S(a,f),...,S(b,i) to be a linear one, which means, that a linear regression is able to estimate the participating weights t,u,v,w. An estimation is necessary, because we do not know anything about the contribution of each single similarity to the whole. We summarize</p><formula xml:id="formula_1">S(Q,P)=tS(a,f)+uS(b,g)+vS(d,h)+wS(b,i) (1)</formula><p>with the t,u,v,w to be estimated.</p><p>How do we get these weights ? We have to take a collection of pairs like (P,Q), in our case we took a sample of size 50, and leave it up to a human to assign the respective similarities S(Q,P). The rest is to be done via a multi-linear regression, minimizing sums of squared errors analogous to the well-known linear regression approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Improvement by feedbacking</head><p>Actually the following ideas are independent from guessing the weights t,u,v,w itself. Let us return to the environment, from which the regression was implemented. We already explained , that the indexing implying the vectors V(D) strongly depends on how far the ontology is developed. Thus the latter fact has also qualitative impact on the results of the matching algorithm. We focus on improving the algorithm by imroving the ontology.</p><p>First, a sub-optimal 1 approach for judging an S(Q,P) is taking as the value of similarity the percentage of positive answers (given by testing persons) to the question, if Q and P are similar. From now on we apply a way of grouping keywords, which is inspired by <ref type="bibr" target="#b2">[3]</ref>, where the authors themselves proposed to include background knowledge in their work. We make use of the 'interestingness'-measure. We want to group keywords, as the clusters with a high rate of interestingness should give hints concerning semantic relations between their participants. The exact semantics then have to be added by human.</p><p>Let us define the interestingness <ref type="bibr" target="#b1">[2]</ref> of a set of keywords appearing in the same text document as the ratio of the probability of a set of keywords to the multiple of the probabilities of occurrence of the single keywords.</p><p>Two starting points of structuring the documents before extracting interesting clusters, a subjective and an objective one, shall finish our reasonings. A subjective pre-grouping follows from what the testing persons percept as similar: we only regard to clusters of keywords carrying a high average of interestingness in a collection C of similar documents. To find C, we must also cluster the documents.</p><p>On the other hand an objective pre-grouping is introduced by defining C via the thematic entries and clustering with respect to the theme. By objectivity in this case we denote selecting a structure given by the themes from the ontology. Here, a theme might consist of several keywords.</p><p>The last step is to present the interesting collections of keywords resulting from either grouping to an ontology engineer and to let him or her decide, if he sees a reason why the ontology might be improved by filling in relations he or she associates with the interesting groups of keywords. Note that our approach deals with strictly supervised learning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">CONCLUSIONS</head><p>From our rather optimistic point of view there clearly exist ideas how to attain at least clues for maintaining an ontology by reuse of the output and evaluation of a matching algorithm. So the feedback of such an algorithm is a human contribution to machine learning-detecting related keywords, which do not have a relation in the ontology yet. Of course the algorithm using background knowledge has to proof its strength-not only in matching documents, but also in case of a growing ontology-is it still exact, when there are many different relations to a keyword ? What are ontologies to master the semantic matching of documents from a special domain properly ?</p><p>Within further work would we like to confirm our idea about an interplay of automated retrieval and a human editor, for example by experimenting with a certain amount of new vocabulary, which could be classified to the ontology in our framework more easily.</p><p>Another way of improving the results is refining the indexing process by the introduction of an additional qualitative tagging of keywords in our vector representation. For example, if it is obvious, that special semantics of an entry is the only interpretation existing in a document, one cuts off background knowledge, which is not in the sense of the semantics, and gets a better preprocessing.</p><p>To end our brief discussion, we mention another field of research, namely the question of how we could derive hints, which point out redundant or even improper ontological. relations.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>= {THEMES: German foreign policy, Gerhard Schröder, IMF INDIVIDUAL KEYWORDS: Gerhard Schröder, Bill Clinton, German government, U.S. government, IMF, Caio Koch-Weser THEMATICAL BACKROUND-KNOWLEDGE: Germany, German government, SPD, international organizations, foreign policy INDIVIDUAL BACKGROUND-KNOWLEDGE: German government, U.S. government, international organizations, USA, Germany} The entries on THEMATICAL BACKROUND-KNOWLEDGE and INDIVIDUAL BACK-GROUND-KNOWLEDGE depend on the modeling of the ontology, usually there are a more keywords listed. THEMATICAL BACKROUND-KNOWL-EDGE refers to the key word from THEMES, INDI-VIDUAL BACKGROUND-KNOWLEDGE belongs to the INDIVIDUAL KEYWORDS. Repetitions of keywords are possible, intended to strengthen the importance of a keyword.</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">is at KOM and intelligent views,</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">is at intelligent views,</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">is at KOM and GMD-IPSI KOM, Technical University Of Darmstadt, Merckstr. 25, 64283 Darmstadt, Germany ||| intelligent views GmbH, Julius-Reiberr str. 17, 64293 Darmstadt, Germany ||| GMD-IPSI, Dolivostr. 15, 64293 Darmstadt,Germany email: 1 afaatz@kom.tu-darmstadt.de, 2 kamps@i-views.de, 3 rst@kom.tu-darmstadt.de</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_3">. 'optimal' settings would be in contrast to quantifying individual and subjective judgements</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Using A large Linguistic Ontology For Internet-Based Retrieval Of Object-Oriented Components</title>
		<author>
			<persName><forename type="first">S</forename><surname>Borgo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Guarino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Masolo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Vetere</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Ninth International Conference on Software Engineering and Knowledge Engineering</title>
				<meeting>the Ninth International Conference on Software Engineering and Knowledge Engineering<address><addrLine>Madrid</addrLine></address></meeting>
		<imprint>
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Beyond Market Baskets: Generalizing association rules to correlations</title>
		<author>
			<persName><forename type="first">S</forename><surname>Brin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Motwani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Silverstein</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 1997 ACM SIGMOD Conference on Management of Data</title>
				<meeting>the 1997 ACM SIGMOD Conference on Management of Data</meeting>
		<imprint>
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">TopCat: Data Mining for Topic Identification in a Text Corpus</title>
		<author>
			<persName><forename type="first">C</forename><surname>Clifton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Cooley</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the PKDD 1999</title>
				<meeting>the PKDD 1999</meeting>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Using background knowledge tn the aggregation of imprecise evidence in databases</title>
		<author>
			<persName><forename type="first">S</forename><surname>Mcclean</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Scotney</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Shapcott</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Elsevier Journal of Data and Knowledge Engineering</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<biblScope unit="issue">2</biblScope>
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Piaget</surname></persName>
		</author>
		<title level="m">Biologie und Erkenntnis</title>
				<meeting><address><addrLine>Frankfurt/Main</addrLine></address></meeting>
		<imprint>
			<publisher>Fischer</publisher>
			<date type="published" when="1992">1992</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Automatisches Indexieren als Erkennen abstrakter Objekte</title>
		<author>
			<persName><forename type="first">G</forename><surname>Knorz</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1983">1983</date>
			<publisher>Max Niemeyer Verlag</publisher>
			<pubPlace>Tübingen</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">Semantic Information Processing</title>
		<editor>M. Minsky</editor>
		<imprint>
			<date type="published" when="1968">1968</date>
			<publisher>MIT Press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Weaving A Web: Structure and Creation of an Object Network Representing an Electronic Reference Framework</title>
		<author>
			<persName><forename type="first">L</forename><surname>Rostek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Fischer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Möhr</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Electronic Publishing</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<date type="published" when="1994">1994</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Automatische Erzeugung von semantischem Markup in Agenturmeldungen</title>
		<author>
			<persName><forename type="first">L</forename><surname>Rostek</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Möhr/Schmidt, SGML und XML</title>
				<meeting><address><addrLine>Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Introduction To Modern Information Retrieval</title>
		<author>
			<persName><forename type="first">G</forename><surname>Salton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Mcgill</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1983">1983</date>
			<publisher>McGraw Hill</publisher>
			<pubPlace>New York</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Knowledge Representation Logical</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">F</forename><surname>Sowa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Philosophical and Computational Foundations</title>
				<imprint>
			<publisher>PWS Publishing Company</publisher>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Schumie: Associative Conceptual Spacebased Information Retrieval Systems</title>
		<author>
			<persName><forename type="first">J</forename><surname>Van Den Berg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename></persName>
		</author>
		<imprint>
			<date type="published" when="1999">1999</date>
			<pubPlace>Delft</pubPlace>
		</imprint>
	</monogr>
	<note type="report_type">technical report</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
