<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>E ective Ontology Learning :</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Khalida Ben Sidi Ahmed</string-name>
          <email>send.to.khalida@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adil Toumouh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mimoun Malki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Djillali Liabes University</institution>
          ,
          <addr-line>Sidi Bel Abbes</addr-line>
          ,
          <country country="DZ">Algeria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Introduction : Ontology Learning</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <fpage>170</fpage>
      <lpage>178</lpage>
      <abstract>
        <p>Ontologies stand in the heart of the Semantic Web. Nevertheless, heavyweight or formal ontologies' engineering is being commonly judged to be a tough exercise which requires time and heavy costs. Ontology Learning is thus a solution for this exigency and an approach for the `knowledge acquisition bottleneck'. Since texts are massively available everywhere, making up of experts' knowledge and their know-how, it is of great value to capture the knowledge existing within such texts. Our approach is thus an interesting research work which tries to answer the challenge of creating concepts' hierarchies from textual data. The signi cance of such a solution stems from the idea by which we take advantage of the Wikipedia encyclopedia to achieve some good quality results.</p>
      </abstract>
      <kwd-group>
        <kwd />
        <kwd>domain ontologies</kwd>
        <kwd>ontology learning from texts</kwd>
        <kwd>concepts' hierarchy</kwd>
        <kwd>Wikipedia</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        However, ontology engineering is a tough exercise which can involve a great
deal of time and considerable costs. The need for (semi) automatic domain
ontologies' extraction has thus been rapidly felt by the research world. Ontology
learning is then the research realm referred to. As a matter of fact, this eld is
the automatic or semi-automatic support for the ontology engineering. It has
indeed the potential to reduce the time as well as the cost of creating an ontology.
For this reason, a plethora of ontology learning techniques have been adopted
and various frameworks have been integrated with standard ontology engineering
tools [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Since the fully automation of these techniques remains in the distant
future, the process of ontology learning is argued to be semi-automatic with an
insistent need for human intervention.
      </p>
      <p>
        Most of the knowledge available on the Web represents natural language texts
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Semantic Web establishment depends a lot on developing ontologies for this
category of input knowledge. This is the reason why this paper focuses especially
on ontology learning from texts. One of the still thorny issues of domain ontology
learning is concepts' hierarchy building. In this paper, we are primarily involved
in creating domain concepts' hierarchies from texts. We plan to use Wikipedia
in order to foster the quality of our results. From this optics, literature reviews
few research works dealing with this issue and none is making use of Wikipedia
on the same way that it is harnessed in our approach.
      </p>
      <p>
        In fact, Wikipedia is recently showing a new potential as a lexical semantic
resource [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. When this collaboratively constructed resource is used to compute
semantic relatedness [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] using its categories' system, this same system is also
used to derive large scale taxonomies [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or even to achieve knowledge acquisition
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The idea of harnessing Wikipedia plain text articles in order to acquire
knowledge is quite promising. Our approach capitalizes on the well organized
Wikipedia articles to retrieve the most useful information at all, namely the
de nition of a concept.
      </p>
      <p>First, we will describe in Section 2 the ontology learning layer cake. In Section
3, we move straightforward to the explanation of our approach which will be
followed by a corresponding evaluation in Section 4. Finally, Section 5 sheds the
lights on some conclusions and research perspectives.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Ontology Learning Layer Cake</title>
      <p>
        The process of extracting a domain ontology can be decomposed into a set of
steps, summarized by [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and commonly known as \ontology learning layer
cake". The following page contains the gure which illustrates these steps.
      </p>
      <p>
        The rst step of the ontology learning process is to extract the terms that
are of great importance to describe a domain. A term is a basic semantic unit
which can be simple or complex. Next, synonyms among the previous set of
terms should be extracted. This allows associate di erent words with the same
concept whether in one language or in di erent languages. These two layers
are called the lexical layers of the ontology learning cake. The third step is
to determine which of the existing terms, those who are concepts. According
to [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], a term can represent a concept if we can de ne: its intention (giving
the de nition, formal or otherwise, that encompasses all objects the concept
describes), its extension (all the objects or instances of the given concept) and
to report its lexical realizations (a set of synonyms in di erent languages).
The extraction of concepts hierarchies, our key concern, is to nd the
relationship `is-a', ie classes and subclasses or hyperonyms. This phase is followed by
the non-taxonomic relations' extraction which consists on seeking for any
relationship that does not t in a previously described taxonomic framework. The
extraction of axioms is the nal level of the learning process and it is argued to
be the most di cult one. To date, few projects have attacked the discovery of
axioms and rules from text.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Concepts' Hierarchy Building Approach</title>
      <p>
        Our approach tackles primarily the construction of concepts' hierarchies from
text documents. We will make a terminology extraction using a dedicated tool
for this task which is TermoStat [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The initial terms will be the subjects of a
de nitions' investigation within Wikipedia. Adapting the idea of the
lexicosyntactic patterns de ned by [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] to our case, the hyperonyms of our terms will be
learned. This process is iterative which comes to its end when an in advance
prede ned maximum number of iterations is reached. Our algorithm generates
in parallel a graph which unfortunately contains cycles and its nodes may have
more then one hyperonym. The hierarchy we promise to build is the
transformation result of the graph to a forest focusing on the hierarchic structure of a
taxonomy. The gure on the following page gives the overall idea of the proposed
approach.
In order to carry out our approach, we should rst undergo the two lexical
ontology learning's layers. The tool we used for the sake of retrieving the domain
terminology is TermoStat. This web application was favored for determined
reasons. In fact, TermoStat requires a corpus of textual data and, juxtaposing
it to a generalized corpus such as BNC (British National Corpus), will give us
a list of the domain terms that we need for the following step. Afterwards, we
try to nd out the synonyms among this list of candidate terms. The use of
thesaurus.com as a tool in order to select synonyms was e cient. The third
layer can be skipped in our context; concepts' hierarchies construction does not
depend on the concepts' de nitions. In other words, our algorithm needs mainly
the candidate terms elected to be representative for the set of its synonyms
(synset). The set of initial candidate terms is named CO.
3.2
      </p>
      <p>Concepts' Hierarchy
The approach we are proposing belongs to two research paradigms, namely
concepts' hierarchies construction for ontology learning and secondly the use of
Wikipedia for knowledge extraction. The achievement of our solution relies
heavily on concepts from graphs' theory.
a. Hyperonyms' Learning using Wikipedia</p>
      <p>At the beginning of our algorithm, we have the following input data:
- G = (N ; A) is an oriented graph such as N is the set of nodes and A is
the set of arcs, N = CO. Our objective is to extend the initial graph
with new nodes and arcs; the former are the hyperonyms and the later
are the subsumption links. The extension of Ci, i is the iteration index,
is done by using the concepts' de nitions extracted from Wikipedia.
- Cgen is a set of general concepts for which we will not look for hyperonyms.</p>
      <p>These elements are de ned by the domain experts including for example
object, element, human being, etc.</p>
      <p>S1 For each cj 2 Ci, we check if cj 2 Cgen. If it is the case, this concept will
be skipped. Else, we look for its de nition in Wikipedia. The de nition
of a given term is always the rst sentence of the paragraph before the
TOC of the corresponding article. Three cases may occur:
1. The term exists in Wikipedia and its article is accessible. Then we
pass to the following step.
2. The concept is so ambiguous that our inquiry leads to the Wikipedia
disambiguation page. In this situation, we ignore the word.
3. Finally, the word for which we seek a hyperonym does not exist in
the database of Wikipedia. Here again, we skip the element.</p>
      <p>S2 For the de nition of the given concept, we apply the principle of Hearst's
patterns. We attempt to collect exhaustive listing of the key expressions
we need. For instance, the de nition may contain: is a, refers to, is
a form of, consists of, etc. This procedure permits us to retrieve the
hyperonym of the concept cj . The new set of concepts is the input data
for the following iteration.</p>
      <p>S3 Add into the graph G the nodes corresponding to the hyperonyms and
the arcs that link these nodes.
b. From Graph to Forest</p>
      <p>
        The main idea which shapes the following stage shares a lot with [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. In fact,
the graph which results from the preceding step has two imperfections. The
rst one is that many concepts are connected to more then one hyperonym.
In addition, The structure of the resulting graph is patently cyclic which
does not concord with the de nition of a hierarchy. An adequate treatment
is paramount in order to clean up the graph from circuits as well as multiple
subsumption links. Thus, we will obtain, at the end, a forest respecting the
structure of a hierarchy.
      </p>
      <p>The following illustrative graph is a piece taken from the whole graph that
we obtained during the evaluation of our approach. It represents a part of
drilling wells' HSE namely the PPE ( Personal Protective Equipment). The
green rectangles are the initial candidate concepts.</p>
      <p>The resolution of the rst raised imperfection implies obviously the resolution
of the second one. Therefore, we will use the following solution:
1. Weigh the arcs as such as to foster long roads within the graph. We will
increment the value assigned to the arc the more we go in depth (it is
already done in g.3 ).
2. We apply the Kruskal's algorithm[1956] which creates a maximal
covering forest from a graph ( g.3 ).</p>
      <p>Finally we have reached the aim we have planned.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Our Approach's Evaluation</title>
      <p>Our evaluation corpus is a set of texts that are collected in the Algerian/British/Norwegian
joint venture Sonatrach / British Petroleum / Statoil. This specialized corpus
deals with the eld of wells' drilling HSE . Throughout our approach,
interventions from the experts are inevitable.</p>
      <p>Tex2Tax is the prototype we have developed using Java. Jsoup is the API which
allows us to access online Wikipedia. The same result is reached if using JWPL
with the encyclopedia's dump. JUNG is the API we have used for the
management of our graphs. The following page's gure is the GUI of our prototype.</p>
      <p>The terminology extraction phase and the synonyms retrieving have given a
collection of 259 domain concepts. The nal graph is formed by 516 nodes and
893 arcs. After having done the cleaning, the concepts' forest holds 323 nodes,
among them 211 are initial candidate terms. The amount of remaining arcs is
of 322. In order to study the taxonomy structure we calculate the compression</p>
      <p>Fig. 4. Tex2Tax prototype's GUI
ratio for the nodes which is 0:63(323 = 516) and the one of the arcs which equals
to 0:36(322 = 893).</p>
      <p>LP = 0:63(323=516).</p>
      <p>LR = 0:36(322=893).</p>
      <p>The precision of our taxonomy is relatively low. This phenomenon is mainly
due to the terms that do not exist in the database of Wikipedia. The graph's
lopping is also responsible of some loss of nodes containing appropriate domain
vocabulary.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>Despite all the work which is done in the eld of ontology learning, a lot of
cooperation, many contributions and resources are needed to be able to really
automate this process. Our approach is one of those few works that harness the
collaboratively constructed resource namely Wikipedia. The results achieved
and which are based on the exploitation of the idea of Hearst's lexico-syntactic
patterns and the graphs' pruning is seen to be very promising. We intend to
improve our work by addressing other issues such as enriching the research base
by the Web, exploiting the categories' system of Wikipedia in order to attack
higher levels of the ontology leaning process such as non-taxonomic relations.
Dealing with disambiguation pages of Wikipedia is of great value and
multilingual ontology learning is, in addition, an alive research area which is just
timidly evoked.</p>
      <p>Acknowledgement We are thankful to the Sonatrach / British Petroleum /
Statoil joint venture's President and its Business Support Manager for giving us
the approval to access the wells' drilling HSE corpus.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Madche,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Staab</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          , and V olker, J. Ontology Learning. In: S. Staab and
          <string-name>
            <given-names>R.</given-names>
            <surname>Studer</surname>
          </string-name>
          .
          <source>Handbook on Ontologies. 2nd revised edition</source>
          . Springer,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[2] IJCAI'2001 Workshop on Ontology Learning, Proceedings of the Second Workshop on Ontology Learning OL'2001</source>
          , Seattle, USA,
          <year>August 4</year>
          ,
          <year>2001</year>
          . CEUR Workshop Proceedings,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>M</surname>
          </string-name>
          <article-title>adche, A. Ontology Learning for the Semantic Web</article-title>
          . Kluwer Academic Publishing,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Zouaq</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Nkambou</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <article-title>A Survey of Domain Ontology Engineering: Methods and Tools</article-title>
          , In Nkambou, Bourdeau and Mizoguchi (Eds):
          <source>'Advances in Intelligent Tutoring Systems'</source>
          , Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Zesch</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , Muller,
          <string-name>
            <given-names>C.</given-names>
            , and
            <surname>Gurevych</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary</article-title>
          .
          <source>In Proceedings of the Conference on Language Resources and Evaluation (LREC)</source>
          .
          <source>European Language Resources Association</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Ponzetto</surname>
            ,
            <given-names>S.P.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Strube</surname>
          </string-name>
          .
          <article-title>Knowledge Derived from Wikipedia for Computing Semantic Relatedness</article-title>
          .
          <source>Journal of Arti cial Intelligence Research</source>
          <volume>30</volume>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Strube</surname>
            <given-names>M.</given-names>
          </string-name>
          et Paolo Ponzetto S.
          <article-title>Wikirelate ! computing semantic relatedness using wikipedia</article-title>
          .
          <source>Proceedings of the National Conference on Arti cial Intelligence (AAAI)</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Ponzetto</surname>
            <given-names>S. P.</given-names>
          </string-name>
          et StrubeM.
          <article-title>Deriving a Large Scale Taxonomy from Wikipedia</article-title>
          .
          <source>AAAI '07</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Nastase</surname>
            <given-names>V.</given-names>
          </string-name>
          et
          <string-name>
            <surname>Strube</surname>
            <given-names>M..</given-names>
          </string-name>
          <article-title>Decoding Wikipedia Categories for Knowledge Acquisition</article-title>
          .
          <source>AAAI '08</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Buitelaar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Magnini</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <article-title>Ontology learning from text: An overview. ontology learning from text: Methods, evaluation and applications</article-title>
          .
          <source>Frontiers in Arti cial Intelligence and Applications Series</source>
          <volume>123</volume>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Drouin</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <article-title>Acquisition automatique des termes : l'utilisation des pivots lexicaux specialises</article-title>
          , thse de doctorat, Montral : Universit de Montral,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Hearst</surname>
            <given-names>M. A.</given-names>
          </string-name>
          et Schutze H.
          <article-title>Customizing a lexicon to better suit a computational task</article-title>
          .
          <source>Proceedings of the ACL SIGLEX Workshop on Acquisition of Lexical Knowledge from Text</source>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Velardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Faralli</surname>
          </string-name>
          .
          <article-title>A Graph-based Algorithm for Inducing Lexical Taxonomies from Scratch</article-title>
          .
          <source>Proc. of the 22nd International Joint Conference on Arti cial Intelligence</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>