<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Use of Ontologies in Wrapper Induction The use of ontologies in wrapper induction</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>DDeeppaarrttmmeennttooffInIfnofromrmatiaotnioanndaKndnoKwnleodwgleedEgnegiEnenegriinnge</institution>
          ,
          <addr-line>eUrinnigv,eUrsintyivoefrsEitcyonoofmEiccsonPoramguices,, WinWsitnosntoCnhCuhrucrhcihlilllSSqq..44,, 113300 6677,, PPrraagguuee3,3C,zCezcehcRheRpuebpluicblic</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2007</year>
      </pub-date>
      <fpage>132</fpage>
      <lpage>135</lpage>
      <abstract>
        <p>The purpose of this entry is to bring in an extension of ontologies so that they can be utilized in the process of automated information extraction from the web documents. Major part of it is dedicated to a proposition and derivation of an inference model for evaluation of the pattern matches and their combination. Further is proposed a simple naïve method of wrapper induction which is able to use the results of the first part.</p>
      </abstract>
      <kwd-group>
        <kwd>ontology</kwd>
        <kwd>automatic annotation</kwd>
        <kwd>information extraction</kwd>
        <kwd>wrapper</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>patterns, both atomic and composite. The composite patterns can be hierarchically
combined which can be of significant concern in some specific situations.
As it has sense to assign only one pattern per datatype property, more patterns will
have to be joined via including them in some composite pattern. Every part of a
document that matches a given pattern, i.e. the pattern rule evaluates positively on
that part, will be considered a suspected partial candidate for the occurrence of the
value of datatype property the pattern is assigned to. If that given pattern is the one
that is assigned directly to the datatype property, every matching part of the document
will be considered to be a suspected candidate for the occurrence of the value.
2.1</p>
      <sec id="sec-1-1">
        <title>Atomic Patterns</title>
        <p>While evaluating the match of atomic patterns we encounter the problem of deriving
the certainty degree of marking the candidate. We take a pattern as an algorithm that
can tell for every place in the document to what extent the rule is satisfied depending
on its parameters. Here we have two distinct terms, the degree of pattern match which
represents the certainty with which the pattern’s algorithm marked the given place in
the document, and the certainty degree of marking the partial candidate for the value
of a certain datatype property which represents the certainty that the given place in
the document really is the occurrence of the value, given sole by the observation of
the single pattern and independently of any other patterns. We will denote marking
the partial candidate as the pattern evidence and therefore the second term will be
equivalent to a degree of pattern evidence.</p>
        <p>The degree of pattern match and the degree of pattern evidence should intuitively
correspond. If we denote the pattern match as A and the pattern evidence as E we can
write down this inference rule:</p>
        <p>
          A → E (1)
We have chosen for our purposes a fuzzy logic inference model [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], but it should be
possible to use any other. In fuzzy logic we can define A and E as propositions
• A – “The pattern has marked the given place in the document.”
• E – “The marked place is really a pattern evidence”
and corresponding degrees as their truth values (i.e. degree of pattern match a=val(A)
and the degree of pattern evidence e=val(E)). We introduce also two universal
parameters for every pattern, namely precision and cover and we define them:
p = val (A → E) – pattern precision (2)
        </p>
        <p>
          c = val (E → A) – pattern cover (3)
While using these parameters we can derive on Łukasiewicz fuzzy logic this form of
the inference rule (the detailed derivation is available in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]):
        </p>
        <p>((A &amp; (A → E)) ∨ ¬(E → A)) ⇒ E (4)
If we take into account the prescriptions of p and c and do not overestimate the degree
of pattern evidence e, we get:</p>
        <p>e = max (a + p -1, 1 – c) (5)
With the use of parameter p we can set a top limit to the degree of a given pattern
evidence. The pattern precision denotes in this context a certainty with what the high
degree of pattern match leads to a high degree of pattern evidence. The c parameter
sets the lowest possible value of the degree of pattern match.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Composite patterns</title>
        <p>Would we like to combine the evidences of multiple patterns it will be a task in form
of a set operation. For this purpose just two set operations come on force, namely
union and intersection, which in combination with the complement operation can
form any other set operation possible. The determination of the degree of composite
pattern evidence itself is then a trivial matter. The degree of composite pattern match
will be determined by simply assembling the degrees of evidences of the partial
patterns with the appropriate logical operation, thus there will be conjoint patterns and
disjoint patters (and possibly negating patterns). From the pattern match defined in
this way we get the composite pattern evidence by the same way as we did in case of
atomic patterns.</p>
        <p>It is interesting to discuss the meaning of precision and cover parameters of the partial
patterns in contrast with the use of the different kinds of composite patterns. In case
of a disjoint composite pattern the high value of p implies that the partial pattern is a
sufficient condition and in case of conjoint composite patterns the high value of c
means that the partial pattern forms a necessary condition.
2.3</p>
      </sec>
      <sec id="sec-1-3">
        <title>Designing the patterns</title>
        <p>
          We will denote the patterns in the extraction ontology as XML elements from a
special namespace nested in the elements of the datatype properties. The extent of the
patterns can vary from distinguishing time values, named entities to patterns that
evaluate the context or format of the document. A few basic patterns are proposed in
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], however many others are possible. While designing a new pattern it is needed to
keep in mind the way it evaluates and think carefully the possibilities of its
combination of other existing patterns.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>A simple wrapper induction method</title>
      <p>By applying the rules of patterns on the content of a document we get a set of
evidences along with their certainty degrees for every datatype property in the
extraction ontology with a pattern assigned. If we rely on tabular structure of data we
can try to separate the evidences in a few segments according to the resemblance of
their XPath. We can purge the sets of evidences if we realize that the precision
attribute specifies the mean ratio of evidences that are marked correctly by the
pattern. Therefore up to 1-p of evidences supplied by this pattern can be false and
hence we can remove that much of the worst segments. If the data are stored in the
tabular structure the relevant parts of text are generally contained in the same
structure of elements that is not changing throughout the segment. On the level of
XPath expression this will show up as a single changing index in the absolute path by
omitting which we get a set of elements that would ideally all contain the value of the
respective property.</p>
      <p>The cover parameter is the mean rate of the evidences that the pattern identifies to the
total real number of occurrences of the respective property. While the generalized
XPath expression should identify all possible occurrences of the property we can
calculate the proportion of evidences of the pattern to this “complete” set and
difference from the parameter c represent the error caused by generalizing the paths.
Based on the number of evidences in segments and the respective absolute XPath we
can assign the corresponding segments of different properties and form the instances
of extracted class.</p>
      <p>To sum up this approach is just a simple method and has many limitations. Besides
that this method can extract only properties with cardinality 1 (the tabular structure) it
is also limited in its tolerance to the irregularities in the structure of the document, on
the other hand to the irregularities in the extracted values it is rather resistant.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion and future work</title>
      <p>
        The proposed method of pattern notation allows hierarchical combining of partial
patterns and is open to the possibility of designing additional patterns according to
one’s need. Similar approach is taken by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], however unlike them we do not
design proprietary formats of ontologies but try to start from OWL standard.
The limitation of the proposed wrapper induction method is the fact that it relies on
the tabular structure of extracted data but the extraction is completely automatic and
with proper setting of the attributes allows the estimation of extraction error.
To propose a way of automatic learning of the patterns or at least of their parameters
could be an interesting subject of future work
      </p>
      <sec id="sec-3-1">
        <title>Acknowledgement</title>
        <p>The research leading to this paper was supported by the European Commission under
contract FP6-027026, Knowledge Space of semantic inference for automatic
annotation and retrieval of multimedia content, K-Space.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Anton</surname>
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>XPath-Wrapper Induction by generalizing tree traversal patterns</article-title>
          , in: Antoniou, G., van
          <string-name>
            <surname>Harmelen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <string-name>
            <given-names>A Semantic</given-names>
            <surname>Web</surname>
          </string-name>
          <string-name>
            <surname>Primer</surname>
          </string-name>
          , Cambridge MA.: MIT Press,
          <year>2004</year>
          , ISBN 0-262-01210-3
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hájek</surname>
            <given-names>P.</given-names>
          </string-name>
          : Metamathematics of fuzzy logic, Dordrecht: Kluwer,
          <year>1998</year>
          , ISBN:
          <fpage>0</fpage>
          -
          <lpage>792</lpage>
          -35238-6
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kushmerick</surname>
          </string-name>
          , N.:
          <article-title>Wrapper induction for information extraction</article-title>
          ,
          <source>PhD thesis</source>
          , University of Washington, 1997
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Labský</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Svátek</surname>
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>On the Design and Exploitation of Presentation Ontologies for Information Extraction</article-title>
          ,
          <source>ESWC'06 Workshop on Mastering the Gap: From Information</source>
          Extraction to Semantic Representation, Budva, Montenegro,
          <year>2006</year>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Muslea</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Minton</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knoblock</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A Hierarchical Approach to Wrapper Induction</article-title>
          , 3rd Conference on Autonomous Agents,
          <year>1999</year>
          , http://www.isi.edu/~muslea/papers.html
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Nekvasil</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <article-title>Využití ontologií při indukci wrapperů, diplomová práce</article-title>
          , VŠE, Praha 2006
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>