=Paper= {{Paper |id=Vol-31/paper-12 |storemode=property |title=Background Knowledge, Indexing and Matching Interdependencies of Document Management and Ontology-Maintenance |pdfUrl=https://ceur-ws.org/Vol-31/AFaatz_11.pdf |volume=Vol-31 |dblpUrl=https://dblp.org/rec/conf/ecai/FaatzKS00 }} ==Background Knowledge, Indexing and Matching Interdependencies of Document Management and Ontology-Maintenance== https://ceur-ws.org/Vol-31/AFaatz_11.pdf
           Background Knowledge, Indexing and Matching-
                Interdependencies of Document Management
                                       and Ontology-Maintenance
                              Andreas Faatz1, Thomas Kamps2, Ralf Steinmetz3


Abstract. This position paper presents an algorithm,                    The following paper yields some propositions about
which determines similarities between text documents.                   a process, in which an algorithm obtains a value of
These text documents are indexed with keywords and fur-
                                                                        similarity from a pair of text documents. Before we
ther background knowledge-terms from an ontology.The
representation of the documents and the evaluation of the               describe the algorithm, we take a brief look at how
algorithm are used to let an ontology learn. This is shown              the documents first have to be made readable to the
to be one way of improving the results of the algorithm by              algorithm and in which fashion background knowl-
improving the background knowledge.                                     edge adds further information to the matching pro-
                                                                        cess. Then we explain the algorithm: its way of
1 INTRODUCTION                                                          matching documents and the parameters in need.
Consider a human being reading texts from domains,                      Finally we give some hints concerning the evaluation
which to a certain extent are familiar to him or her.                   and improvement of the algorithm. This will be the
The reader is capable of the semantics of the text                      point, where background knowledge gets affected by
documents. Even if the person is not an expert in any                   our results and we will distinguish objective and sub-
of the domains described in the texts, a minimal                        jective influences on the background knowledge.
comment we expect him or her to state is, weather
two texts are similar or not. This kind of judgement
                                                                        2 PREPROCESSING THE DATA
also includes text documents, which possess similari-
ties though containing a completely different vocab-                    We consider a corpus of short text documents to be
ulary or sharing just a few common terms.                               given. Any document D is attached with a vector
Similarities are a part of the intellectual construction                v(D) including a description of its contents. The vec-
of reality [5] and generated by what words and                          tor is a result of abstracting a text into descriptors-
phrases the human mind associates to the actual text.                   this can be done either by a knowledge worker or -
                                                                        keeping in mind the constraints from the business
In a business application grouping documents by
                                                                        application we referred to in the introduction- by an
their similarity undergoes restrictions: the job has to
                                                                        automatic indexing [6,9]. Note that our approach
be done fast, for instance managing the continuous
                                                                        only works in case of a controlled vocabulary of
flow of short messages coming in to the editors of a
                                                                        descriptors. Furthermore we discuss a type of back-
newspaper. Moreover, the document base in use by
                                                                        ground knowledge meeting the requirements of an
the newspaper is too large, so an editor is not able to
                                                                        ontology.
retrieve all similar texts in time.
                                                                        To keep our discourse comprehensive we define an
We apply the above situation to a computer instead
                                                                        ontology to be a set of terms and their relationships.
of a human reader. Our goal is to express similarities
                                                                        An example of building such an ontology in an
of text documents detected by an algorithm. Hence a
                                                                        object-oriented fashion can be found in [8], for
semantic matching problem is to be solved. The asso-
                                                                        diverse definitions of an ontology we refer to [11].
ciations and heuristics recognizing similarities
beyond equalities of character strings have to be                       To be precise, possible vector entries (index terms) in
modeled somehow, otherwise we are restricted to                         v(D) must represent a controlled vocabulary V to
plain full text retrieval [10], like many of the web-                   keep them computer-readable and capable of com-
based search-engines taking HTML as an input.                           parisons. The index terms of the vocabulary V are

1 is at KOM and intelligent views, 2 is at intelligent views, 3 is at
   KOM and GMD-IPSI
  KOM, Technical University Of Darmstadt, Merckstr. 25, 64283
  Darmstadt, Germany ||| intelligent views GmbH, Julius-Reiber-
r str. 17, 64293 Darmstadt, Germany ||| GMD-IPSI, Dolivostr. 15,
  64293 Darmstadt,Germany
  email: 1 afaatz@kom.tu-darmstadt.de, 2 kamps@i-views.de,
3 rst@kom.tu-darmstadt.de
exactly the concepts of a predefined ontology, con-        rial entries) from the first index vector V(P) and f,...,i
nected     by the ontological relations. The relations     are the collections of keywords from the second vec-
we perform with are typed semantic ones like ’is sub-      tor V(Q). We assume the operation on the
concept of’, ’is differential of’ or ’is associated        S(a,f),...,S(b,i) to be a linear one, which means, that a
with’. Example of an index vector: imagine a text-         linear regression is able to estimate the participating
document D describing the German chancellor                weights t,u,v,w. An estimation is necessary, because
Schröder visiting the U.S., where he meets President       we do not know anything about the contribution of
Clinton and argues with him about the chair of the         each single similarity to the whole. We summarize
IMF. The vectorial representation V(D) is:
V(D)=                                                          S(Q,P)=tS(a,f)+uS(b,g)+vS(d,h)+wS(b,i)                       (1)
{THEMES: German           foreign    policy,   Gerhard
Schröder, IMF                                              with the t,u,v,w to be estimated.
INDIVIDUAL KEYWORDS: Gerhard Schröder,                     How do we get these weights ? We have to take a
Bill Clinton, German government, U.S. government,          collection of pairs like (P,Q), in our case we took a
IMF, Caio Koch-Weser                                       sample of size 50, and leave it up to a human to
                                                           assign the respective similarities S(Q,P). The rest is
THEMATICAL          BACKROUND-KNOWLEDGE:                   to be done via a multi-linear regression, minimizing
Germany, German government, SPD, international             sums of squared errors analogous to the well-known
organizations, foreign policy                              linear regression approach.
INDIVIDUAL BACKGROUND-KNOWLEDGE:
German government, U.S. government, international          3.2 Improvement by feedbacking
organizations, USA, Germany}
                                                           Actually the following ideas are independent from
The entries on THEMATICAL BACKROUND-                       guessing the weights t,u,v,w itself. Let us return to
KNOWLEDGE and INDIVIDUAL BACK-                             the environment, from which the regression was
GROUND-KNOWLEDGE depend on the modeling                    implemented. We already explained , that the index-
of the ontology, usually there are a more keywords         ing implying the vectors V(D) strongly depends on
listed. THEMATICAL BACKROUND-KNOWL-                        how far the ontology is developed. Thus the latter
EDGE refers to the key word from THEMES, INDI-             fact has also qualitative impact on the results of the
VIDUAL BACKGROUND-KNOWLEDGE belongs                        matching algorithm. We focus on improving the
to the INDIVIDUAL KEYWORDS. Repetitions of                 algorithm by imroving the ontology.
keywords are possible, intended to strengthen the          First, a sub-optimal1 approach for judging an S(Q,P)
importance of a keyword.                                   is taking as the value of similarity the percentage of
                                                           positive answers (given by testing persons) to the
3 SEMANTIC MATCHING                                        question, if Q and P are similar. From now on we
                                                           apply a way of grouping keywords, which is inspired
3.1 The algorithm                                          by [3], where the authors themselves proposed to
In contrast to classical full text retrieval technology    include background knowledge in their work. We
our method provides more structure. As was to be           make use of the ’interestingness’-measure. We want
seen from the last paragraph we include background         to group keywords, as the clusters with a high rate of
knowledge, which delivers more than synonyms. A            interestingness should give hints concerning seman-
first version of the matching algorithm deals with a       tic relations between their participants. The exact
type of overlap-measuring of the entries of a pair of      semantics then have to be added by human.
vectors. We named the measure 'frequency' because          Let us define the interestingness [2] of a set of key-
of the way its functionality was implemented in the        words appearing in the same text document as the
Smalltalk programming language.                            ratio of the probability of a set of keywords to the
Let us define a frequency measure of the similarity of     multiple of the probabilities of occurrence of the sin-
two sets of words as the number of words appearing         gle keywords.
in both sets (whereby every repetition of a word is        Two starting points of structuring the documents
extra-counted) divided by the total of all words. An       before extracting interesting clusters, a subjective
example: (sun, sun, rain) and (sun, sun, snow) have        and an objective one, shall finish our reasonings. A
the frequency 4/6.                                         subjective pre-grouping follows from what the test-
The output S(Q,P) of the matching algorithm is the         ing persons percept as similar: we only regard to
similarity of a pair of documents. In fact it is a
weighted sum of similarities S(a,f),...,S(b,i), where
                                                              1.   ’optimal’ settings would be in contrast to quantifying
a,...,d are the collections of keywords (i.e. the vecto-
                                                                   individual and subjective judgements
clusters of keywords carrying a high average of inter-               [2] S. Brin, R. Motwani, C. Silverstein: Beyond Market Baskets:
estingness in a collection C of similar documents. To                Generalizing association rules to correlations, Proceedings of the
                                                                     1997 ACM SIGMOD Conference on Management of Data, 1997
find C, we must also cluster the documents.
                                                                     [3] C. Clifton, R. Cooley: TopCat: Data Mining for Topic Identifi-
On the other hand an objective pre-grouping is intro-                cation in a Text Corpus, Proceedings of the PKDD 1999
duced by defining C via the thematic entries and                     [4] S. McClean, B. Scotney, M. Shapcott: Using background
clustering with respect to the theme. By objectivity in              knowledge tn the aggregation of imprecise evidence in databases,
                                                                     Elsevier Journal of Data and Knowledge Engineering, Vol.32/2,
this case we denote selecting a structure given by the
                                                                     2000
themes from the ontology. Here, a theme might con-                   [5] J. Piaget: Biologie und Erkenntnis, Fischer, Frankfurt/Main,
sist of several keywords.                                            1992
The last step is to present the interesting collections              [6] G. Knorz: Automatisches Indexieren als Erkennen abstrakter
of keywords resulting from either grouping to an                     Objekte, Max Niemeyer Verlag, Tübingen, 1983
                                                                     [7] M. Minsky (ed.): Semantic Information Processing, MIT Press,
ontology engineer and to let him or her decide, if he
                                                                     1968
sees a reason why the ontology might be improved                     [8] L. Rostek, D. Fischer, W. Möhr: Weaving A Web: Structure
by filling in relations he or she associates with the                and Creation of an Object Network Representing an Electronic
interesting groups of keywords. Note that our                        Reference Framework, Electronic Publishing 6, 1994
approach deals with strictly supervised learning.                    [9] L. Rostek: Automatische Erzeugung von semantischem
                                                                     Markup in Agenturmeldungen, in: Möhr/Schmidt, SGML und
                                                                     XML, Springer, Heidelberg 1999
4 CONCLUSIONS                                                        [10] G. Salton, M.J. McGill: Introduction To Modern Information
                                                                     Retrieval, McGraw Hill, New York, 1983
From our rather optimistic point of view there clearly
                                                                     [11] J.F. Sowa: Knowledge Representation Logical, Philosophical
exist ideas how to attain at least clues for maintaining             and Computational Foundations, PWS Publishing Company,
an ontology by reuse of the output and evaluation of                 1998.
a matching algorithm. So the feedback of such an                     [12] J. van den Berg, M. Schumie: Associative Conceptual Space-
algorithm is a human contribution to machine learn-                  based Information Retrieval Systems, technical report, Delft, 1999
ing- detecting related keywords, which do not have a
relation in the ontology yet. Of course the algorithm
using background knowledge has to proof its
strength- not only in matching documents, but also in
case of a growing ontology- is it still exact, when
there are many different relations to a keyword ?
What are ontologies to master the semantic match-
ing of documents from a special domain properly ?
Within further work would we like to confirm our
idea about an interplay of automated retrieval and a
human editor, for example by experimenting with a
certain amount of new vocabulary, which could be
classified to the ontology in our framework more
easily.
Another way of improving the results is refining the
indexing process by the introduction of an additional
qualitative tagging of keywords in our vector repre-
sentation. For example, if it is obvious, that special
semantics of an entry is the only interpretation exist-
ing in a document, one cuts off background knowl-
edge, which is not in the sense of the semantics, and
gets a better preprocessing.
To end our brief discussion, we mention another field
of research, namely the question of how we could
derive hints, which point out redundant or even
improper ontological. relations.


REFERENCES
[1] S. Borgo, N. Guarino, C. Masolo, G. Vetere: Using A large Lin-
guistic Ontology For Internet-Based Retrieval Of Object-Oriented
Components, Proceedings of the Ninth International Conference
on Software Engineering and Knowledge Engineering, Madrid,
1997