<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Background Knowledge, Indexing and Matching- Interdependencies of Document Management and Ontology-Maintenance</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andreas Faatz</string-name>
          <email>afaatz@kom.tu-darmstadt.de</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Kamps</string-name>
          <email>kamps@i-views.de</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ralf Steinmetz</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>1997</year>
      </pub-date>
      <abstract>
        <p>This position paper presents an algorithm, which determines similarities between text documents. These text documents are indexed with keywords and further background knowledge-terms from an ontology.The representation of the documents and the evaluation of the algorithm are used to let an ontology learn. This is shown to be one way of improving the results of the algorithm by improving the background knowledge.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 INTRODUCTION</title>
      <p>
        Consider a human being reading texts from domains,
which to a certain extent are familiar to him or her.
The reader is capable of the semantics of the text
documents. Even if the person is not an expert in any
of the domains described in the texts, a minimal
comment we expect him or her to state is, weather
two texts are similar or not. This kind of judgement
also includes text documents, which possess
similarities though containing a completely different
vocabulary or sharing just a few common terms.
Similarities are a part of the intellectual construction
of reality [
        <xref ref-type="bibr" rid="ref4">5</xref>
        ] and generated by what words and
phrases the human mind associates to the actual text.
In a business application grouping documents by
their similarity undergoes restrictions: the job has to
be done fast, for instance managing the continuous
flow of short messages coming in to the editors of a
newspaper. Moreover, the document base in use by
the newspaper is too large, so an editor is not able to
retrieve all similar texts in time.
      </p>
      <p>
        We apply the above situation to a computer instead
of a human reader. Our goal is to express similarities
of text documents detected by an algorithm. Hence a
semantic matching problem is to be solved. The
associations and heuristics recognizing similarities
beyond equalities of character strings have to be
modeled somehow, otherwise we are restricted to
plain full text retrieval [
        <xref ref-type="bibr" rid="ref9">10</xref>
        ], like many of the
webbased search-engines taking HTML as an input.
The following paper yields some propositions about
a process, in which an algorithm obtains a value of
similarity from a pair of text documents. Before we
describe the algorithm, we take a brief look at how
the documents first have to be made readable to the
algorithm and in which fashion background
knowledge adds further information to the matching
process. Then we explain the algorithm: its way of
matching documents and the parameters in need.
Finally we give some hints concerning the evaluation
and improvement of the algorithm. This will be the
point, where background knowledge gets affected by
our results and we will distinguish objective and
subjective influences on the background knowledge.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 PREPROCESSING THE DATA</title>
      <p>
        We consider a corpus of short text documents to be
given. Any document D is attached with a vector
v(D) including a description of its contents. The
vector is a result of abstracting a text into
descriptorsthis can be done either by a knowledge worker or
keeping in mind the constraints from the business
application we referred to in the introduction- by an
automatic indexing [
        <xref ref-type="bibr" rid="ref5 ref8">6,9</xref>
        ]. Note that our approach
only works in case of a controlled vocabulary of
descriptors. Furthermore we discuss a type of
background knowledge meeting the requirements of an
ontology.
      </p>
      <p>
        To keep our discourse comprehensive we define an
ontology to be a set of terms and their relationships.
An example of building such an ontology in an
object-oriented fashion can be found in [
        <xref ref-type="bibr" rid="ref7">8</xref>
        ], for
diverse definitions of an ontology we refer to [
        <xref ref-type="bibr" rid="ref10">11</xref>
        ].
To be precise, possible vector entries (index terms) in
v(D) must represent a controlled vocabulary V to
keep them computer-readable and capable of
comparisons. The index terms of the vocabulary V are
V(D)=
{THEMES:
Schröder, IMF
      </p>
      <p>German
foreign
policy,</p>
      <p>Gerhard
exactly the concepts of a predefined ontology,
connected by the ontological relations. The relations
we perform with are typed semantic ones like ’is
subconcept of’, ’is differential of’ or ’is associated
with’. Example of an index vector: imagine a
textdocument D describing the German chancellor
Schröder visiting the U.S., where he meets President
Clinton and argues with him about the chair of the
IMF. The vectorial representation V(D) is:
INDIVIDUAL KEYWORDS: Gerhard Schröder,
Bill Clinton, German government, U.S. government,
IMF, Caio Koch-Weser
THEMATICAL BACKROUND-KNOWLEDGE:
Germany, German government, SPD, international
organizations, foreign policy
INDIVIDUAL BACKGROUND-KNOWLEDGE:
German government, U.S. government, international
organizations, USA, Germany}
The entries on THEMATICAL
BACKROUNDKNOWLEDGE and INDIVIDUAL
BACKGROUND-KNOWLEDGE depend on the modeling
of the ontology, usually there are a more keywords
listed. THEMATICAL
BACKROUND-KNOWLEDGE refers to the key word from THEMES,
INDIVIDUAL BACKGROUND-KNOWLEDGE belongs
to the INDIVIDUAL KEYWORDS. Repetitions of
keywords are possible, intended to strengthen the
importance of a keyword.</p>
    </sec>
    <sec id="sec-3">
      <title>3 SEMANTIC MATCHING</title>
    </sec>
    <sec id="sec-4">
      <title>3.1 The algorithm</title>
      <p>In contrast to classical full text retrieval technology
our method provides more structure. As was to be
seen from the last paragraph we include background
knowledge, which delivers more than synonyms. A
first version of the matching algorithm deals with a
type of overlap-measuring of the entries of a pair of
vectors. We named the measure 'frequency' because
of the way its functionality was implemented in the
Smalltalk programming language.</p>
      <p>Let us define a frequency measure of the similarity of
two sets of words as the number of words appearing
in both sets (whereby every repetition of a word is
extra-counted) divided by the total of all words. An
example: (sun, sun, rain) and (sun, sun, snow) have
the frequency 4/6.</p>
      <p>The output S(Q,P) of the matching algorithm is the
similarity of a pair of documents. In fact it is a
weighted sum of similarities S(a,f),...,S(b,i), where
a,...,d are the collections of keywords (i.e. the
vectorial entries) from the first index vector V(P) and f,...,i
are the collections of keywords from the second
vector V(Q). We assume the operation on the
S(a,f),...,S(b,i) to be a linear one, which means, that a
linear regression is able to estimate the participating
weights t,u,v,w. An estimation is necessary, because
we do not know anything about the contribution of
each single similarity to the whole. We summarize
S(Q,P)=tS(a,f)+uS(b,g)+vS(d,h)+wS(b,i)
(1)
with the t,u,v,w to be estimated.</p>
      <p>How do we get these weights ? We have to take a
collection of pairs like (P,Q), in our case we took a
sample of size 50, and leave it up to a human to
assign the respective similarities S(Q,P). The rest is
to be done via a multi-linear regression, minimizing
sums of squared errors analogous to the well-known
linear regression approach.</p>
    </sec>
    <sec id="sec-5">
      <title>3.2 Improvement by feedbacking</title>
      <p>Actually the following ideas are independent from
guessing the weights t,u,v,w itself. Let us return to
the environment, from which the regression was
implemented. We already explained , that the
indexing implying the vectors V(D) strongly depends on
how far the ontology is developed. Thus the latter
fact has also qualitative impact on the results of the
matching algorithm. We focus on improving the
algorithm by imroving the ontology.</p>
      <p>
        First, a sub-optimal1 approach for judging an S(Q,P)
is taking as the value of similarity the percentage of
positive answers (given by testing persons) to the
question, if Q and P are similar. From now on we
apply a way of grouping keywords, which is inspired
by [
        <xref ref-type="bibr" rid="ref2">3</xref>
        ], where the authors themselves proposed to
include background knowledge in their work. We
make use of the ’interestingness’-measure. We want
to group keywords, as the clusters with a high rate of
interestingness should give hints concerning
semantic relations between their participants. The exact
semantics then have to be added by human.
Let us define the interestingness [
        <xref ref-type="bibr" rid="ref1">2</xref>
        ] of a set of
keywords appearing in the same text document as the
ratio of the probability of a set of keywords to the
multiple of the probabilities of occurrence of the
single keywords.
      </p>
      <p>Two starting points of structuring the documents
before extracting interesting clusters, a subjective
and an objective one, shall finish our reasonings. A
subjective pre-grouping follows from what the
testing persons percept as similar: we only regard to
1. ’optimal’ settings would be in contrast to quantifying
individual and subjective judgements
clusters of keywords carrying a high average of
interestingness in a collection C of similar documents. To
find C, we must also cluster the documents.
On the other hand an objective pre-grouping is
introduced by defining C via the thematic entries and
clustering with respect to the theme. By objectivity in
this case we denote selecting a structure given by the
themes from the ontology. Here, a theme might
consist of several keywords.</p>
      <p>The last step is to present the interesting collections
of keywords resulting from either grouping to an
ontology engineer and to let him or her decide, if he
sees a reason why the ontology might be improved
by filling in relations he or she associates with the
interesting groups of keywords. Note that our
approach deals with strictly supervised learning.</p>
    </sec>
    <sec id="sec-6">
      <title>4 CONCLUSIONS</title>
      <p>From our rather optimistic point of view there clearly
exist ideas how to attain at least clues for maintaining
an ontology by reuse of the output and evaluation of
a matching algorithm. So the feedback of such an
algorithm is a human contribution to machine
learning- detecting related keywords, which do not have a
relation in the ontology yet. Of course the algorithm
using background knowledge has to proof its
strength- not only in matching documents, but also in
case of a growing ontology- is it still exact, when
there are many different relations to a keyword ?
What are ontologies to master the semantic
matching of documents from a special domain properly ?
Within further work would we like to confirm our
idea about an interplay of automated retrieval and a
human editor, for example by experimenting with a
certain amount of new vocabulary, which could be
classified to the ontology in our framework more
easily.</p>
      <p>Another way of improving the results is refining the
indexing process by the introduction of an additional
qualitative tagging of keywords in our vector
representation. For example, if it is obvious, that special
semantics of an entry is the only interpretation
existing in a document, one cuts off background
knowledge, which is not in the sense of the semantics, and
gets a better preprocessing.</p>
      <p>To end our brief discussion, we mention another field
of research, namely the question of how we could
derive hints, which point out redundant or even
improper ontological. relations.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Brin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Motwani</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Silverstein: Beyond Market Baskets: Generalizing association rules to correlations</article-title>
          ,
          <source>Proceedings of the 1997 ACM SIGMOD Conference on Management of Data</source>
          ,
          <year>1997</year>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Clifton</surname>
          </string-name>
          , R. Cooley:
          <article-title>TopCat: Data Mining for Topic Identification in a Text Corpus</article-title>
          ,
          <source>Proceedings of the PKDD 1999</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>McClean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Scotney</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Shapcott: Using background knowledge tn the aggregation of imprecise evidence in databases</article-title>
          ,
          <source>Elsevier Journal of Data and Knowledge Engineering</source>
          , Vol.
          <volume>32</volume>
          /2, 2000
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Piaget</surname>
          </string-name>
          : Biologie und Erkenntnis, Fischer, Frankfurt/Main, 1992
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Knorz</surname>
          </string-name>
          <article-title>: Automatisches Indexieren als Erkennen abstrakter Objekte</article-title>
          , Max Niemeyer Verlag, Tübingen,
          <year>1983</year>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Minsky</surname>
          </string-name>
          (ed.):
          <source>Semantic Information Processing</source>
          , MIT Press,
          <year>1968</year>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Rostek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.</surname>
          </string-name>
          <article-title>Möhr: Weaving A Web: Structure and Creation of an Object Network Representing an Electronic Reference Framework</article-title>
          ,
          <source>Electronic Publishing 6</source>
          ,
          <fpage>1994</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Rostek</surname>
          </string-name>
          <article-title>: Automatische Erzeugung von semantischem Markup in Agenturmeldungen</article-title>
          , in: Möhr/Schmidt, SGML und XML, Springer, Heidelberg 1999
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.J. McGill</surname>
          </string-name>
          : Introduction To Modern Information Retrieval,
          <source>McGraw Hill</source>
          , New York,
          <year>1983</year>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.F.</given-names>
            <surname>Sowa</surname>
          </string-name>
          <article-title>: Knowledge Representation Logical, Philosophical and Computational Foundations</article-title>
          , PWS Publishing Company,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [12]
          <string-name>
            <surname>J. van den Berg</surname>
          </string-name>
          , M.
          <source>Schumie: Associative Conceptual Spacebased Information Retrieval Systems</source>
          ,
          <source>technical report, Delft</source>
          ,
          <year>1999</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>