<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Knowledge extraction from webpages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sylvain Tenier</string-name>
          <email>tenierg@inist.fr</email>
          <email>tenierg@loria.fr</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amedeo Napoli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xavier Polanco</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yannick Toussaint</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institut National de l'Information Scienti que et Technique 54514 Vandoeuvre-ls-Nancy</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Laboratoire Lorrain de Recherche en Informatique et ses Applications BP 239</institution>
          ,
          <addr-line>54506 Vandoeuvre ls Nancy Cedex</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>101</fpage>
      <lpage>104</lpage>
      <abstract>
        <p>This article presents a system to extract Knowledge from webpages by producing semantic annotations. taking into account semantic information from the domain to annotate an element in a webpage implies solving two problems : (1) identifying the syntactic structure of this element in the webpage and (2) identifying the most speci c concept (in terms of subsumption) of the ontology that will be used to annotate this element. Our approach relies on a wrapper-based machine learning algorithm combined with reasoning making use of the formal structure of the ontology.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>2</p>
    </sec>
    <sec id="sec-2">
      <title>Extracting knowledge</title>
      <sec id="sec-2-1">
        <title>Manual annotation</title>
        <p>The system is based on machine learning techniques ( g. 1). It learns from some
examples the data to be extracted. Here, the examples are webpages annotated
by hand according to an ontology. This task is performed in a dedicated
environment in which a user is presented a page to annotate and the concepts of the
ontology in a browser-like interface. The user then annotates a few occurences
of the concepts of the ontology he wants the system to identify. The number of
annotations needed depends on the regularity of the page. If the data is strongly
structured, like tables, only two or three examples are needed. For example, to
extract knowledge about research teams, the SW RC ontology ( g. 2) is loaded
along with a page presenting the persons working in a team and the projects
they are involved in. The user then annotates some data that are instances of
the concept of Person and some instantiating the concept of Project. The output
is a marked document in which the annotations are embedded.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Wrapper induction using the tree structure of the page</title>
        <p>
          The learning algorithm is derived from Kushmerick's work on wrapper induction,
which identi es classes of wrappers than can be learnt using a deterministic
algorithm with low complexity [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. A wrapper is a procedure using the syntactic
regularities of a document to extract data. This is particularly suited to
semistructured documents like webpages. We have adapted Kushmerick's work to
make use of the tree structure of a webpage provided by the w3c's Document
Object Model (DOM). This model de nes a path leading to each data in the
document. The marked page is taken as input and the similarities between paths
leading to data which are instances of the same concept are learnt. The output
is a wrapper which is applied to pages in which the tree structure is similar to
the example page in order to extract the data and their relationships.
In step 3, the relevant wrappers are applied to the documents to extract data.
For each extracted data, an instance of the concept it belongs to is added to the
KB together with its relationships with other data. The resulting KB is a graph
implemented using the Resource Description Framework (RDF) which connects
each individual to the ontology and the individuals it is related to. At that point,
one problem is that the extracted knowledge is not speci c enough. The reason is
that a wrapper must be as generic as possible in order to extract all the relevant
data in a document. Therefore, it is induced using the most general concept
available. For example, to extract people from a research team according to the
SW RC ontology, the wrapper will be designed to recognize any instance of the
concept Person instead of more speci c concepts like AcademicSta or Student.
The second limitation is that the semantics of the relationships is unknown, since
they are extracted using syntactic properties (for instance, two data are related
if they have a common parent node).
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Re ning the knowledge</title>
      <p>With an ontology implemented in the Knowledge Representation language OWL,
reasoning mechanisms are provided. One of them is instantiation, which given
an individual and a set of concepts nds the most speci c concept the
individual belongs to, with respect to the subsumption ordering. For example, in the
SW RC ontology, two concepts are subsumed by the Person concept and have
a role whose ller must be an instance of the Project concept. Therefore, the
individuals in the KB that are instances of a Person and have a relationship
to a Project must be either instances of AcademicSta or PhdStudent and the
relationship is identi ed as being the Project role. The new knowledge is inserted
into the KB.</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>
        We have presented a system that integrates semantics in the annotation
generation process by making use of the reasoning mechanisms provided by the
ontology according to which webpages are annotated. This requires not only concept
instances but also role instances to be extracted. Initial works on webpage
annotation such as Annotea [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] aimed at enabling collaborative work between people.
The need for annotations machines could understand and reason with led to
systems, such as S-CREAM [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], producing semantic annotations according to
Description Logics based ontologies. Since manual annotation is a tedious and
error-prone task, machine learning system have been proposed; S-CREAM and
MnM [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] implement Amilcare [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a semi-automatic tool that produces
extraction rules from a corpus to generate concept instances; however, role instances
are not dealt with. Recently, fully automatic systems like Amardillo [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or
CPankow [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] have been presented that make use of the redundancy of data on the
Web. Our system relies on the hypothesis of a of mapping between the syntax
and the semantics of a webpage. Since its e ciency depends on the presence of
regularities in the structure, pages from research team websites are well suited
for this task.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Kiryakov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popov</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ognyano</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kirilov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goranov</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Semantic annotation, indexing, and retrieval</article-title>
          . In: International Semantic Web Conference. (
          <year>2003</year>
          )
          <volume>484</volume>
          {
          <fpage>499</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sattler</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tobies</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Practical reasoning for very expressive description logics</article-title>
          .
          <source>CoRR cs.LO/0005013</source>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kushmerick</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weld</surname>
            ,
            <given-names>D.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doorenbos</surname>
          </string-name>
          , R.B.:
          <article-title>Wrapper induction for information extraction</article-title>
          .
          <source>In: IJCAI (1)</source>
          . (
          <year>1997</year>
          )
          <volume>729</volume>
          {
          <fpage>737</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kahan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koivunen</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          :
          <article-title>Annotea: an open rdf infrastructure for shared web annotations</article-title>
          .
          <source>In: WWW '01: Proceedings of the 10th international conference on World Wide Web</source>
          , New York, NY, USA, ACM Press (
          <year>2001</year>
          )
          <volume>623</volume>
          {
          <fpage>632</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Handschuh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Staab</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ciravegna</surname>
          </string-name>
          , F.:
          <article-title>S-cream-semi-automatic creation of metadata</article-title>
          .
          <source>Proc. of the European Conference on Knowledge Acquisition and Management</source>
          (
          <year>2002</year>
          ) Springer Verlag (submitted version).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Vargas-Vera</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domingue</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lanzoni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stutt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ciravegna</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Mnm: Ontology driven semi-automatic and automatic support for semantic markup</article-title>
          .
          <source>In: EKAW</source>
          . (
          <year>2002</year>
          )
          <volume>379</volume>
          {
          <fpage>391</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ciravegna</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dingli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilks</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petrelli</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Adaptive information extraction for document annotation in amilcare</article-title>
          .
          <source>In: SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , New York, NY, USA, ACM Press (
          <year>2002</year>
          )
          <volume>451</volume>
          {
          <fpage>451</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ciravegna</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chapman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dingli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wilks</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Learning to harvest information for the semantic web</article-title>
          .
          <source>In: ESWS</source>
          . (
          <year>2004</year>
          )
          <volume>312</volume>
          {
          <fpage>326</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ladwig</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Staab</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Gimme' the context: context-driven automatic semantic annotation with c-pankow</article-title>
          .
          <source>In: WWW '05: Proceedings of the 14th international conference on World Wide Web</source>
          , New York, NY, USA, ACM Press (
          <year>2005</year>
          )
          <volume>332</volume>
          {
          <fpage>341</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>