<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Information Extraction Rules for Extending Domain Ontologies - Position Statement for the IJCAI-2001 Workshop on Ontology Learning -</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Sintek</string-name>
          <email>sintek@dfki.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Junker</string-name>
          <email>junker@dfki.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ludger van Elst</string-name>
          <email>elst@dfki.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Abecker</string-name>
          <email>aabecker@dfki.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>49 631 205 3210 e-Mail</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>German Research Center for Artificial Intelligence (DFKI) - Knowledge Management Group - P.</institution>
          <addr-line>O. Box 2080, D-67608 Kaiserslautern</addr-line>
          ,
          <country>Germany Phone:</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Ontology Learning with Information Extraction Rules</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Ontologies in the FRODO Project In the FRODO project [1] we aim at the development of a “Framework for Distributed Organizational Memories” (OMs). We start with the observation that knowledge and expertise is always heavily distributed in an organization. We accept the fact that this is not an intermediary, imperfect state which should be overcome by a central, ontologically structured information system, but rather a natural and meaningful situation (because during the introduction of OM systems it is normal to start with small, focussed systems which should interoperate later; because much expertise is better to be created, hold, and maintained locally; or because in the case of interorganizational collaborations or virtual teams a deeper integration of information systems cannot be achieved). Hence, a main goal of the FRODO project is to develop a scalable, extensible OM middleware built for easy integration of new components and linking of collaborating components [2]. FRODO builds upon the KnowMore framework for contextually-aware, ontology-based OMs [3,4], but relaxes some constraints of the original model, especially the idea of a centralized OM using one overall set of organizational ontologies. Besides the technical provisions for such a distributed, highly dynamic environment, we lay special emphasis on considerations and methods which are necessary to realize such a scenario in industrial practice. In each industrial environment, besides the questions of smooth introduction of new technology regarding human factors and organizational processes, and besides the question of modeling tools and method support for knowledge (in particular ontologies for structuring OMs or parts of OMs) acquisition, at least two other factors are of utmost importance: One is the predominance of informal, i.e. essentially textbased, representations of knowledge. This is not only just a matter of fact, but really useful, because the cost of formalization is often not in the right relation to the potential benefits such that many informal parts of the scenario are economically reasonable [5]. One implication is that also methods for building formal models must be affordable. The other is the fact that ontologies are not a stand-alone component built once and then remaining untouched, but a living element in the overall scenario, used for different purposes, communicating with other system parts, and representing knowledge about a continuously changing world [10]. These two assumptions lead to two characteristics of our approach: Learning ontological information from text documents should be a main component of the overall scenario. We set the goal already in [3]. In the meanwhile we sketched a method for business-process oriented knowledge modeling in the company, realized as an amalgamation of the CommonKADS [6] and the IDEF5 [7] suites of methods [2]. We build upon the Prote´ge´-2000 knowledge acquisition and modeling tool [8] which we extended already by some modules for modeling, reasoning, and visualization (see [1]). We are currently working on an integration of the MindAccess(r) commercial [9] text analysis workbench which employs a numberof statistical document feature extraction and document analysis functionalities. In order to cope with the complexity and dynamics of real-world usage scenarios for ontologies in a distributed OM, we develop a methodological framework for understanding and organizing the roles, responsibilities, rights, and obligations of actors constituting an ontology society in a complex, agent-based OM system [10].</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. An initial, hand-crafted seed ontology of reasonable
quality which contains already the relevant types of
relationships between ontology concepts in the given
domain.
2. An initial set of documents which exemplarily represent
(informally) substantial parts of the knowledge
represented formally in the seed ontology.
Now we assume that similar ontological phenomena—e.g.
the fact that relationship R holds between concept A and
concept B—are expressed in the text in similar ways. Consider,
e.g., a medical domain where the fact that Disease A can be
treated (this is the Relationship R) with Cure B. Such A-R-B
instances of relationship R could, for instance, look like:
My headache was cured by medication with Aspirin.
Sue’s headache was addressed with acupuncture.</p>
    </sec>
    <sec id="sec-2">
      <title>Cancer can be treated with chemotherapy.</title>
    </sec>
    <sec id="sec-3">
      <title>Cancer is often treated with surgery.</title>
      <p>Our main idea is that, (i) given such texts are available
which explain the ontological knowledge, and (ii) given these
texts are sufficiently similar with respect to the question how
similar factual statements are textually represented, it should
be possible:
1. To take the pairs of (ontological statement, one or more
textual representations) as positive examples for the way
how specific ontological statements can be reflected in
texts. There are two possibilities to extract such
examples:</p>
      <p>Based on the seed ontology, the system looks up the
signature of a certain relation (e.g., R links a
Disease with a Cure), searches all occurrences of
instances of the concept classes Disease and Cure,
respectively, within a certain maximum distance, and
regards these co-occurrences as positive examples
for relationship R. This approach presupposes that
the seed documents have some “definitional”
character, like domain specific lexica or textbooks.</p>
      <p>
        The user goes through the seed documents with a
marker and manually highlights all interesting
passages as instances of some relationship. This
approach is more work-intensive, but promises faster
learning and more precise results. We employed
this approach already successfully in an industrial
information extraction project [
        <xref ref-type="bibr" rid="ref11">12</xref>
        ].
2. Employ a pattern learning algorithm to automatically
construct information extraction rules which abstract
from the specific examples, thus creating general
statements which text patterns are an evidence for a certain
ontological relationship. In the example above, such an
information extraction rule could have the form:
      </p>
    </sec>
    <sec id="sec-4">
      <title>In order to detect an instance of the “Method</title>
      <p>B is a possible Cure for Disease A”
relationship, search for an instance of the concept
Disease, look whether there is a synonym of the
word (stem) “treat” in a distance of at most
two words, search for the word “with” in a
distance of at most two words, directly followed
by an instance of the concept Cure.</p>
      <p>In order to learn such information extraction rules, we
need some prerequisites:
(a) A sufficiently detailed representation of documents
(in particular, including word positions, which is
not usual in conventional, vector-based learning
algorithms, WordNet-synsets, and part-of-speech
tagging).
(b) A sufficiently powerful representation formalism
for extraction patterns.
(c) A learning algorithm which has direct access
to background knowledge sources, like the
already available seed ontology containing
statements about known concept instances, or like the
WordNet database of lexical knowledge linking
words to their synonyms sets, giving access to
suband superclasses of synonym sets, etc.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref12 ref13">13,14</xref>
        ] we present an ILP-like rule learner
specifically adapted to the task of pattern-based text
classification (which can be solved with the same methods as the
information extraction task used in the ontology learning
application) which fulfills these requirements. In
particualar, this rule learner relies on a document
representation in which the order of words is preserved. Thus,
learned text patterns can test on the order and distance
of specific words. In [
        <xref ref-type="bibr" rid="ref15">16</xref>
        ] it is shown how its
implementation concepts can be mapped to standard ILP
approaches, which shows how its expressive power with
respect to pattern representation can even be extended
towards full LP formalisms including recursive rules. In
[
        <xref ref-type="bibr" rid="ref14">15</xref>
        ] we elaborate a bit on the integration of background
knowledge sources, especially WordNet.
3. Apply these learned information extraction rules to
other, new text documents to discover new or not yet
formalized instances of relationship R in the given
application domain.
3
      </p>
      <p>
        Status
The algorithm described has not yet been implemented and
tested. However, all required prerequisites are available as
described above and in [
        <xref ref-type="bibr" rid="ref12 ref13 ref14 ref15">13,14,15,16</xref>
        ]. Further, we are in
contact with several application projects (in the nuclear and the
chemical industry) in order to get significant test data. A
critical factor for the success of the approach will be the question
of how typical the textual representations of specific (kinds
of) statements will be in the seed documents.
      </p>
      <p>Compared to other ontology learning approaches it should
be noted that our technique is not restricted to learning
taxonomic relationships, but arbitrary relationships in an
application domain. We expect that, in contrast to more
statistically oriented approaches, which tend to result in too many
candidate results (because of many possibly relevant word
co-occurences), our approach needs more input and assumes
more prerequisites, but found relationship candidates will be
correct with a higher probability.</p>
    </sec>
    <sec id="sec-5">
      <title>1. FRODO project homepage:</title>
      <p>kl.de/frodo/
http://www.dfki.uni</p>
    </sec>
    <sec id="sec-6">
      <title>Engi</title>
      <p>URL:</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          2.
          <string-name>
            <surname>Abecker</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernardi</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>van Elst</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lauer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maus</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwarz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sintek</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2001</year>
          ).
          <article-title>FRODO: A Framework for Distributed Organizations - Milestone M1: Requirements Analysis and System Architecture</article-title>
          .
          <source>DFKI Document D-01-01</source>
          . In preparation. Partially in German.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          3.
          <string-name>
            <surname>Abecker</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernardi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinkelmann</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , Ku¨hn, O., and
          <string-name>
            <surname>Sintek</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>1998</year>
          ).
          <article-title>Towards a Technology for Organizational Memories</article-title>
          .
          <source>IEEE Intelligent Systems</source>
          ,
          <volume>13</volume>
          (
          <issue>3</issue>
          ), May/June.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          4.
          <string-name>
            <surname>Abecker</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernardi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinkelmann</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , Ku¨hn, O., and
          <string-name>
            <surname>Sintek</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Context-Aware, Proactive Delivery of Task-Specific Knowledge: The KnowMore Project</article-title>
          .
          <source>International Journal on Information System Frontiers</source>
          , Kluwer,
          <volume>2</volume>
          (
          <issue>3</issue>
          /4).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Buckingham</given-names>
            <surname>Shum</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          (
          <year>1997</year>
          ).
          <article-title>Balancing Formality with Informality: User-Centred Requirements for Knowledge Management Technologies</article-title>
          .
          <source>AIKM'97: AAAI Spring Symposium on Artificial Intelligence in Knowledge Management</source>
          , Stanford University, Palo Alto, CA. AAAI Press.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          6.
          <string-name>
            <surname>Schreiber</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Akkermans</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anjeiwerden</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de</surname>
            <given-names>Hoog</given-names>
          </string-name>
          , R.,
          <string-name>
            <surname>Shadbolt</surname>
          </string-name>
          , N., van de Velde, W., and
          <string-name>
            <surname>Wielinga</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>1999</year>
          ).
          <article-title>Knowledge Engineering and Management: The CommonKADS Methodology</article-title>
          . MIT Press.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          7.
          <string-name>
            <surname>Information</surname>
          </string-name>
          <article-title>Integration for Concurrent neering (</article-title>
          <year>1994</year>
          ).
          <article-title>IDEF5 Method Report</article-title>
          . http://www.idef.com/ .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          8.
          <string-name>
            <surname>Grosso</surname>
            ,
            <given-names>W.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eriksson</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fergerson</surname>
            ,
            <given-names>R.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gennari</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tu</surname>
            ,
            <given-names>S.W.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Musen</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          (
          <year>1999</year>
          ).
          <article-title>Knowledge Modeling at the Millennium (The Design and Evolution of Protege-2000)</article-title>
          . SMI-1999-0801. Stanford Medical Lab. URL: protege.stanford.edu
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          9. MindAccess product description (
          <year>2000</year>
          ).
          <article-title>Insiders information management GmbH, Kaiserslautern</article-title>
          . URL: http://www.im-insiders.de/html/infomaterial.html. In German.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          10. van Elst,
          <string-name>
            <given-names>L.</given-names>
            and
            <surname>Abecker</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          (
          <year>2001</year>
          ).
          <article-title>Ontology-Related Services in Agent-Based Distributed Information Infrastructures</article-title>
          . Submitted to:
          <source>SEKE'01, The Thirteenth International Conference on Software Engineering &amp; Knowledge Engineering</source>
          , June 13-15,
          <year>2001</year>
          , Buenos Aires - Argentina
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lavrac</surname>
          </string-name>
          , N. and Dzeroski, S. (
          <year>1994</year>
          ).
          <article-title>Inductive Logic Programming: Techniques and Applications</article-title>
          . Chichester, UK: Ellis Horwood.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          12.
          <article-title>ANNOCLASS project description</article-title>
          . http://www.dfki.de/pas/f2w.cgi?daimc/annoclass-e
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          13.
          <string-name>
            <surname>Junker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Heuristisches Lernen von Regeln fu¨r die Textkategorisierung</article-title>
          .
          <source>Dissertation</source>
          . Fachbereich Informatik. Universita¨t Kaiserslautern. In German.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          14.
          <string-name>
            <surname>Junker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Abecker</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>1998</year>
          ).
          <article-title>Learning Complex Pattern for Document Categorization</article-title>
          . In: AAAI98/ICML Workshop on Learning for Text Categorization. Madison, Wisconsin, USA.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          15.
          <string-name>
            <surname>Junker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Abecker</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>1997</year>
          ).
          <article-title>Exploiting Thesaurus Knowledge in Rule Induction for Text Classification</article-title>
          . In: RANLP'
          <fpage>97</fpage>
          - Recent Advances in NLP, pp.
          <fpage>202</fpage>
          -
          <lpage>207</lpage>
          ,
          <string-name>
            <surname>Tzigov</surname>
            <given-names>Chark</given-names>
          </string-name>
          , Bulgaria.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          16.
          <string-name>
            <surname>Junker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sintek</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Rinck</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2000</year>
          ).
          <article-title>Learning for Text Categorization and Information Extraction with ILP</article-title>
          .
          <source>In Learning Language in Logic</source>
          , Springer,
          <string-name>
            <surname>LNCS</surname>
          </string-name>
          <year>1925</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>