<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Visual Web Information Extraction with Lixto</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Robert Baumgartner</string-name>
          <email>baumgart@dbai.tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergio Flesca</string-name>
          <email>esca@deis.unical.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georg Gottlob</string-name>
          <email>gottlob@dbai.tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DBAI, TU Wien Favoritenstr.</institution>
          <addr-line>9 1040 Vienna</addr-line>
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DEIS, Universita della Calabria Via Pietro Bucci</institution>
          ,
          <addr-line>41C-42C 87030 Rende (CS)</addr-line>
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Pattern Instance Base</institution>
          ,
          <addr-line>hierarchically ordered</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present new techniques for supervised wrapper generation and automated web information extraction, and a system called Lixto implementing these techniques [6]. Our system can generate wrappers which translate relevant pieces of HTML pages into XML. Lixto, of which a working prototype has been implemented, assists the user to semi-automatically create wrapper programs by providing a fully visual and interactive user interface. In this convenient user-interface very expressive extraction programs can be created. Internally, this functionality is re ected by the new logic-based declarative language Elog. Users never have to deal with Elog and even familiarity with HTML is not required. Lixto can be used to create an \XML-Companion" for an HTML web page with changing content, containing the continually updated XML translation of the relevant information.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION AND MOTIVATION</title>
      <p>
        Nowadays web content is mainly formatted in HTML.
This is not expected to change soon, even if more exible
languages such as XML are attracting a lot of attention.
While both HTML and XML are languages for
representing semistructured data, the rst is mainly
presentationoriented and is not really suited for database applications.
XML, on the other hand, separates data structure from
layout and provides a much more suitable data representation
(cf. e.g. [
        <xref ref-type="bibr" rid="ref1 ref17">1, 17</xref>
        ]). A set of XML documents can be regarded
as a database and can be directly processed by a database
application or queried via one of the new query languages
for XML, such as XML-GL [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], XML-QL [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and XQuery
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. As the following example shows, the lack of
accessibility of HTML data for querying has dramatic consequences
on the time and cost spent to retrieve relevant information
from web pages.
      </p>
      <p>Imagine you would like to monitor interesting eBay o ers
(www.ebay.com) of notebooks, where an interesting o er is,
for example, de ned by an auction item which contains the
word \notebook", has current value between gbp 1500 and
3000 and which has received at least three bids so far. The
eBay site does not o er the possibility to formulate such
complex queries. Similar sites do not even give restricted</p>
      <p>All new methods and algorithms presented in this paper
are covered by a pending patent. Future developments of
Lixto will be reported at www.lixto.com. This paper also
appeared in the proceedings of VLDB 2001.
user</p>
      <p>Interactive
Pattern Builder
work on XML
query possibilities and leave you with a large number of
result records organised in a huge table split over many web
pages. You have to wade through all these records
manually, because of no possibility to further restrict the result.
Another drawback is that you cannot directly collect
information of di erent auction sites (e.g. onetwosold and ebay
items together) into a single structured le, a di cult task
of web information integration due to very di erent
presentation on each site.</p>
      <p>The solution is thus to use wrapper technology to extract
the relevant information from HTML documents and
translate it into XML which can be easily queried or further
processed. Based on a new method of identifying and extracting
relevant parts of HTML documents and translating them
to XML format, we designed and implemented the e cient
wrapper generation tool Lixto, which is particularly
wellsuited for building HTML/XML wrappers and introduces
new ideas and programming language concepts for wrapper
generation. Once a wrapper is built, it can be applied
automatically to continually extract relevant information from a
permanently changing web page.</p>
      <p>
        The Lixto method and system ful lls the requirements
speci ed in a very recent paper on e-commerce tools [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]:
\These tools must be targeted at typical, non-technical
content managers. In order to be usable, the tools must be
graphical and interactive, so that content managers see data
as it is being mapped." Lixto's distinctive features are
summarised in the following. Lixto is easy to learn and use
beExtractor lo events Controller
(continual) licog trcon
example set
(usually a single page)
      </p>
      <p>XML Generator /
Simple Query</p>
      <p>System
Transformer</p>
      <p>XML
cause a fully visual and interactive user interface is provided.
Neither manual ne-tuning nor knowledge of the internal
language is necessary. Lixto uses straightforward region
marking and selection procedures that allow even those users
not familiar with HTML to work with the wrapper
generator. Lixto lets a wrapper designer work directly and solely
on browser-displayed example pages, unlike other tools (see
Section 6), that force the designer to work with other
document views such as, e.g., table-views of the document or
displayed HTML parse trees, or even HTML sources.
After selecting example targets in the browser display, Lixto
responds with highlighted targets in the same display (see
Section 3). With Lixto, very expressive visual wrapper
generation is possible: It allows for extraction of target patterns
based on surrounding landmarks, on the contents itself, on
HTML attributes, on the order of appearance and on
semantic and syntactic concepts. Extraction is not limited to
tokens of some document object model, but also possible
from at strings. Multiple and single targets are treated in
a uniform way. Lixto even allows for more advanced features
such as disjunctive pattern de nitions, crawling to other
pages during extraction, recursive wrapping. Moreover, the
extracted data structures do not have to strictly obey the
input HTML structure. Preliminary results on
representative web pages with using the current Lixto prototype show
a good performance (see Section 5).</p>
      <p>The above mentioned features are internally re ected by
a declarative extraction language called Elog (see Section 4),
which uses a datalog-like logical syntax and semantics. Elog
is invisible to the user. It is ideally suited for representing
and successively incrementing the knowledge about patterns
described by users. This knowledge is generated in an
interactive process consisting of successive narrowing (logical
and ) and broadening (logical or ) steps. An Elog program
is a collection of datalog-like rules containing special
extraction conditions in their bodies. Elog is exible, intuitive and
easily extensible.</p>
      <p>This paper is structured as follows. In the next section
the system architecture is described, in Section 3 we give an
overview of the the interactive pattern generation and visual
UI, whereas Section 4 is devoted to the theory of the Elog
extraction language. Section 5 presents empirical results
of using the Lixto wrapper generator, Section 6 discusses
related approaches and Section 7 highlights future research
directions.</p>
    </sec>
    <sec id="sec-2">
      <title>ARCHITECTURE / IMPLEMENTATION</title>
      <p>
        A working prototype of Lixto already has been
implemented with Java using Swing, OroMatcher [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] and JDOM
[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. The Lixto Toolkit (Figure 1) consists of the following
modules:
      </p>
      <p>The Interactive Pattern Builder provides the visual UI
that allows a user to specify the desired extraction patterns
and the basic algorithm for creating a corresponding Elog
wrapper as output.</p>
      <p>The Extractor is the Elog program interpreter that
performs the actual extraction based on a given Elog program.
The extractor, provided with an HTML document and a
previously constructed program, generates as its output a
pattern instance base, a data structure encoding the extracted
instances as hierarchically ordered trees and strings. One
program as input of the extractor can be used for
continual extraction on changing pages, or to extract from several
current
example
source
selected
after element
current example target
HTML pages of similar structure.</p>
      <p>With the controller of the XML Generator, the user chooses
how to map extracted information to XML. Its transformer
module performs the actual translation from the extracted
pattern instance base to XML.
3.
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>WRAPPER GENERATION</title>
    </sec>
    <sec id="sec-4">
      <title>Creating Wrappers</title>
      <p>A Lixto wrapper is created interactively by creating
patterns in a hierarchical order. For example, one can rst
dene a pattern &lt;item&gt; and then de ne a subpattern &lt;price&gt;.
The subpattern relationship in this case expresses that each
extracted instance of &lt;price&gt; must occur within one
instance of &lt;item&gt;. Pattern names act as default XML
element names. Each pattern characterises one kind of
information. The set of extracted instances of a pattern, which
are either HTML elements, list of elements, or strings,
depends on the current page. Each pattern is de ned by one
or more lters. A lter e.g. allows the system to identify a
set of similar nodes of the HTML parse tree, for instance a
set of items internally represented as &lt;td&gt;.</p>
      <p>A lter is created as follows: First, the user highlights
with the mouse a representative instance of the desired
target pattern directly on the example page. Internally, the
system associates to this instance a generalised tree path in
the HTML parse tree identifying similar instances and
incorporates this as main goal of a Elog rule representing the lter
(see Section 4). Second, the user adds restrictive conditions
to the lter. These are re ected by the system as additional
goals in the rule body describing this lter. The possible
conditions, which will be explained in more detail, include:
(a) before/after conditions that express that the target
pattern instance must appear before or after some speci c
element. (b) notbefore/notafter conditions that express
that some speci c element must not be close to the target
pattern. (c) internal conditions that express that some
speci c element must (not) appear inside the target pattern.
(d) range conditions which, in case of multiple matchings,
restrict the set of matched instances to a subinterval.</p>
      <p>
        Adding a lter to a pattern extends the set of extracted
targets, whereas imposing a condition to a lter restricts the
set of targets. Alternately imposing conditions and adding
new lters can perfectly characterise the desired
information. The system creates Elog rules based on user-de ned
lters. The user is never concerned with the internal
language Elog. The user interface is extremely simple and the
entire wrapper construction process can be learned by an
average user in very short time. The user is guided through
pattern ∨
a supervised pattern generation, and by simply marking
relevant information items on-screen and visually setting
constraints, lters and patterns are created. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we describe
an example program construction.
      </p>
    </sec>
    <sec id="sec-5">
      <title>3.2 Pattern Creation Algorithm</title>
      <p>The generation of a pattern is described in Fig. 3. The
user can hierarchically de ne and re ne patterns. She
enters a pattern name, speci es the parent pattern S, selects
by mouse clicks one example instance s of the parent
pattern and marks (with the mouse) an element (e.g. one price)
inside this instance on the sample page.</p>
      <p>At the beginning, i.e. when facing a new HTML
document (which is loaded into an internal browser; see Fig.
2) and having created a new program, the only pattern is
&lt;document&gt; with a unique instance, the current example
document. Fig. 3 distinguishes interactive (I) and automatic
(A) steps and gives the logical pattern structure in its
topleft corner. A pattern may consist of multiple lters. Each
lter contains a number of conditions. An extracted instance
must satisfy all conditions of at least one lter. Two
consecutive mouse clicks on di erent parts of the current parent
instance are interpreted in the best possible way to mark an
HTML element of the document parse tree (cf. Fig. 7) or if
not possible a list of elements.</p>
      <p>The system generates a basic lter without conditions,
but the user can already state some attribute requirements
(the system constructs a suitable element path de nition,
see Section 4). Then it highlights all objects on the current
example page that match these initial lter criteria (not only
in the current pattern instance, but in all pattern instances).
Sometimes a user wants a single match within one source,
sometimes multiple matches { this makes no di erence in
the algorithm { it just depends on the de nition of lters
and conditions. E.g., if the user marks a table row, the
system recognises the entity &lt;tr&gt; and highlights all table
rows occurring at a comparable level in the document. At
the same time the system constructs a general Elog rule for
extracting table rows.</p>
      <p>If the user is satis ed with the elements identi ed by the
system, she can con rm the pattern de nition.
Satisfaction, in this context, means that only desired targets are
matched. Otherwise, if the concept is too general, then she
can add restricting conditions (which are re ected by Elog
condition predicates); cf. Fig. 4. For each such restriction,
the system adds the corresponding condition atom to the
C o n d i t i o n B u i l d e r adds one condition to a filter
Select a range by indicating the first
and last relevant target instance to be
extracted (I)</p>
      <p>Characterise
element that is
not allowed to
appear before/
after target. (I/A)
range notbf notaf</p>
      <p>Select an element
Select type(Io)f condition. abfetfeorre bpineasfrotearnent/caienftseatrnatdnhceweei;txhuaisnmetrphliees
guided by wizards. (I)</p>
      <p>Set the
distance
tolerance
in percent to
left/right. (I)
internal</p>
      <p>Select an element
inside the instance and
within the parent
instance; user is
guided by wizards. (I)</p>
      <p>
        Choose attributes of selected element
to be considered (e.g. content, font-type
etc.). A wizard automatically proposes
particularly relevant attributes. (I/A).
Elog rule de ning the lter at hand. Each lter is intended
to extract a subset of the desired target set. If the current
pattern is less general than intended by the user, another
lter can be added, internally re ected by an additional Elog
rule for the same pattern (several rules for the same pattern
are interpreted disjunctively, as usual in Datalog). Di erent
lters may be created based on labelling in di erent
example parent pattern instances. By iterating restricting and
generalising steps, it is usually possible to describe a desired
pattern perfectly. Once a pattern is de ned, the user may
use this pattern as parent for a new pattern. A detailed
creation of a simple example wrapper is given in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>3.3 The Visual Interface</title>
      <p>The current implementation includes visual tree pattern
construction and the use of string patterns. All lter
conditions discussed in this paper are supported. Moreover,
the visual interface is assisted by an XML visualisation tool
which at each instant shows the user the so far extracted
XML code. A concept atom generator to create prede ned
concepts (such as \isCity", \isDate") based both on regular
expressions and on reading some database tables is currently
being added. Such concepts are especially useful to allow
users to create string patterns without knowledge of regular
expressions. Fig. 5 shows the main menu of Lixto (left-hand
side). There, a new program can be created or an existing
one loaded, new patterns can be added, the document for
labelling can be chosen, etc. The same gure shows on its
right hand side the source selection dialogue which enables
the user to select at which node to create a new pattern. Fig.
2 shows the internal Lixto browser when selecting an after
element. For each condition, an own interface is provided
which uses the user-labelled information.</p>
    </sec>
    <sec id="sec-7">
      <title>3.4 Translation into XML</title>
      <p>The output by the extractor is well-suited for translation
into XML. The interactive XML generator exploits the
hierarchical structure of the pattern instance base and uses
pattern names as default XML element names. The user
can interactively choose the HTML attributes that appear
in the XML output. Even more important is the possibility
to decide which patterns are written to XML, possibly using
auxiliary patterns. Fig. 9 displays the result of applying a
(not illustrated) wrapper program onto the web page of Fig.
6.
currently
constructed
pattern
pattern and
filter deletion
option menu
current
status
current program name or
“new program” if program current document which is
is not yet saved (hidden) used for program creation</p>
      <sec id="sec-7-1">
        <title>Create the new pattern and select highlighted instance as example target of first filter</title>
        <p>Select
parent
pattern
open an url
used for creating create a new pattern (naming;
patterns enables adding of filters)
test and save the current
patterns (all its filters)</p>
      </sec>
      <sec id="sec-7-2">
        <title>Highlighting</title>
      </sec>
      <sec id="sec-7-3">
        <title>Option</title>
      </sec>
      <sec id="sec-7-4">
        <title>Currently constructed</title>
        <p>hierarchy of patterns</p>
      </sec>
      <sec id="sec-7-5">
        <title>Return to menu</title>
      </sec>
      <sec id="sec-7-6">
        <title>Choose another</title>
        <p>pattern instead
A leaf node
of the pattern
tree
navigate through
pattern instances
to select example
parent instance</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>DATA EXTRACTION</title>
    </sec>
    <sec id="sec-9">
      <title>A first glance at Elog</title>
      <p>Elog is the system-internal datalog-like rule based
language speci cally designed for hierarchical and modular data
extraction. A user of Lixto does not have to learn Elog and
never sees the Elog program. Elog rules are the
implementations of the visually de ned lters and de ne elements to
be extracted from web pages. Before we discuss the features
of the language in detail, have a look at Fig. 8, in particular
at the rule with head predicate record(S; X). Observe that
we use as in Prolog the same variables for each rule, and
denote with \ " a variable in whose instantiations we are
not interested. This predicate identi es records on an eBay
page (each one is an own table). The rst atom in the rule
body speci es that the context S of the extraction, i.e. the
so-called parent pattern, is an instance of &lt;tableseq&gt;. The
second atom in the rule body looks for subelements that
qualify as tables inside the unique &lt;tableseq&gt; instance and
instantiates X with them. Given that the same Elog
program can be applied to di erent web pages, the actual
elements that an Elog program de nes and extracts depend on
the current web page. For this reason, we refer to the head
predicates de ned by an Elog program as patterns.
Moreover, we denote a set of rules with the same head as pattern,
too. The syntax and semantics of Elog and its predicates is
explained below (only informally due to space constraints).
4.2</p>
    </sec>
    <sec id="sec-10">
      <title>Document Model</title>
      <p>Consider the example web page lixto.html of Fig. 6
and its parse tree as displayed in Fig. 7 based on the Java
Swing parser. The values in brackets are the start and
end-o sets (in characters) of the corresponding elements in
the actual document. Additionally, we number nodes in a
depth- rst left-to-right fashion. Nodes of the HTML tree
refer to elements which are represented as sets. The set
contains pairs describing the association between attribute
names and corresponding attribute values. E.g., the &lt;body&gt;
element node of Fig. 7 is associated with f(name,body),
(bgcolor,FFFFFF), (elementtext,Items for : : : : : : 137)g (whole
document text). Fig. 7 highlights two other such attribute
http://www.dbai.tuwien.ac.at/lixto.html
sets.</p>
      <p>Observe that in our chosen document object model,
several leaf elements are &lt;content&gt; elements { this parser treats
tags such as &lt;b&gt; (bold-face) as attributes of an imaginary
&lt;content&gt; element. We introduced a special attribute called
\elementtext" for each element. This attribute re ects the
contents of the element, which is in case of an internal node
the left-to-right concatenation of the leaf elements below the
internal node. In the following, we distinguish tree regions
and subtrees of the HTML tree. A tree region is a region
rooted at an internal node of the HTML tree where only the
i-th up to the j-th child and their descendants are
considered. Observe that a tree region is contiguous. A subtree
is the tree rooted at one node of the HTML tree, i.e. all
descendants are considered.
4.3</p>
    </sec>
    <sec id="sec-11">
      <title>Extraction Mechanisms</title>
      <p>Lixto o ers two basic mechanisms of data extraction {
tree and string extraction. For tree extraction, we
identify elements with their corresponding tree paths and
possibly some properties of the elements themselves. This does
not necessarily identify a single element. As an example,
?:table: ? :tr is a valid tree path. In the sample page page of
Fig. 6, three elements are matched. The star acts as
wildcard. The expression : ? :x matches all paths to x which
contain x as last element only. A plain tree path is a sequence
of consecutive nodes in a subtree of an HTML tree. In an
incompletely speci ed tree path stars may be used instead
of element names. For simplicity, incompletely speci ed tree
2
h4
5
p-implied
3
content
4
content
6
hr</p>
      <p>content
(name, content)
(a, href = "mailto:steven@...")
(href, "mailto:steven@...")
(elementtext, Steven)
(257,263)
content</p>
      <p>content tr
body
(6,277)</p>
      <p>(75,276)
(name, table) (border,1)
(elementtext,56 K Modem....)</p>
      <p>(width,75%)
center
(23,276)
p
td
td
paths are referred to as tree paths. The semantics of a tree
path applied to a tree region of an HTML page is de ned as
the set of matched elements.</p>
      <p>Attribute Conditions are constraints reducing the
number of matched elements. They pose requirements on
occurring attributes and their values. An attribute condition
is a triple specifying a required name, a required value (a
string, or in case the third parameter is regvar, a
regular expression possibly containing some variables indicated
by nvar), and a special parameter exact, substr or regvar,
indicating that the attribute value is exactly the required
string, is a superstring of it, or matches the given regular
expression, respectively. Instead of giving a formal de
nition, we illustrate this with an example: (?:hr; [(size; [3
4] ; regvar ); (width; %; substr )]) identi es horizontal rules of
size 3 or 4 with a width speci ed in percent. Each output
variable, which is included in the second parameter must be
used as input for a concept of the same rule (cf. Section 4.4).</p>
      <p>An element path de nition epd consists of a tree path and
a set of attribute conditions. It is called simple if it consists
of one element name only. The semantics of applying an
element path de nition to a tree region of an HTML tree is
given as the set of matched elements of the corresponding
tree path which moreover satisfy all of the attribute
conditions. Instead of element path de nitions, equivalently,
XPath expressions can be used (with some extensions, such
as the possibility to express that an attribute value is a
concept). To simplify presentation, however, we stick to our
introduced notation.</p>
      <p>
        The second extraction method relies on strings. In the
HTML parse tree, strings are represented by the text of
content leaves. However, we associate a string to every node
of the parse tree available as the value of the attribute
elementtext. For instance when extracting access codes of the
phone-numbers of lixto.html, string extraction has to be
used. A substring of the elementtext of an HTML tree is
denoted as string source. One can express that a string source
must match a given regular expression. A string path de
nition spd is a regular expression possibly containing some
variables (variable Y indicated by nvar[Y ]) which appear in
some concept predicate of the corresponding rule. Regular
expressions are powerful tools for text processing and
matching. Refer to [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] for a Java regular expression library.
Extraction generates minimal non-overlapping substrings. The
nal two patterns of Fig. 8 give an example of string
extraction. An attribute path de nition apd helps to extract values
of attributes. It is simply a string (expressing the attribute
name).
4.4
      </p>
    </sec>
    <sec id="sec-12">
      <title>Language Definition</title>
      <p>Elog atoms correspond to special predicates with a
wellde ned semantics. They operate on source objects (tree
regions and string sources), path de nition objects and
numerical arguments and obey binding conventions. In a
dataloglike language, the function mapping a given source S to
a set of elements matching an epd is treated as relation
subelem(S; epd; X). subelem(s; epd; x) evaluates to true i
s is a tree region, epd is an element path de nition and x
is a tree region contained in s where the root of x matches
epd. Note that the tree path speci ed in a tree extraction
de nition predicate is always relative to the parent-pattern
instance.</p>
      <p>Extraction de nition predicates specify a set of
extraction instances. One of these is subelem. As far as string
extraction is concerned, the predicate subtext(S; spd; X) is
used. There, S is either a tree region or a string source, and
X a string source. Two more extraction de nition predicates
are built-in. (1) subsq(S; epd; fpd ; lpd ; X): If s and x are tree
regions, epd is an element path de nition, and fpd and lpd
are simple element path de nitions, subsq(s; epd; fpd ; lpd ; x)
evaluates to true i the root of x satis es epd, its rst child
satis es fpd and its last one lpd . (2) subatt(S; apd; X): If
s is a tree region, x a string source and att is an attribute
path de nition of the root element of s, then subatt(s; apd; x)
evaluates to true i x is the value of apd. subatt gives the
possibility to extract the values of attributes.</p>
      <p>Context condition predicates specify that some other
subtree or text must (not) appear before or after the
desired extraction target. For example, on a page with several
tables, the nal table could be identi ed by an external
condition stating that no table appears after the desired table.</p>
      <p>Before predicates are explained here, after predicates work
analogously. (1) before(S; X; epd; b; e; Y; P ): If s and x are
tree regions, then before(s; x; epd; b; e; y; p) evaluates to true
i y is a subtree whose root node is matched by epd and the
end o set of y precedes the start o set of x within relative
distance p where b p e. (2) notbefore(S; X; epd; d): If s
and x are tree regions, then notbefore(s; x; epd; d) evaluates
to true i no element satisfying epd precedes x within
relative distance d. The same predicates are de ned for string
extraction: There, S is an arbitrary source, X is required to
be a string source, spd is used instead of epd and instead of
the root node simply the string itself is used. The percentual
distance values b and e de ne the tolerance interval where
the element is allowed to occur inside the current
parentpattern instance. Additionally, a condition predicate may
contain new variables Y and P , which can be referred by
other conditions. To express that an element occurs
anywhere within the parent instance and before the target (or a
condition output), the distance values are set to 0 and 100,
respectively.</p>
      <p>Internal conditions predicates impose conditions on
the internal structure. Imagine, for instance, one wants to
extract all tables containing somewhere a word typeset in
tablesq(S; X)
record(S; X)
itemnum(S; X)
itemdes(S; X)
price(S; X)
bids(S; X)
currency(S; X)
pricewc(S; X)
document(\www:ebay:com="; S); subsq(S; (:body; []); (:table; []); (:table; []); X);
before(S; X; (:table; [(elementtext; item; substr]); 0; 0; ; ); after(S; X; :hr; 0; 0; ; )
tableseq( ; S); subelem(S; :table; X)
record( ; S); subelem(S; ?:td; X); notbefore(S; X; :td; 100)
record( ; S); subelem(S; (?:td: ? :content; [(a; ; substr)]; X)
record( ; S); subelem(S; (?:td; [(elementtext; nvar[Y]: ; regvar)]); X); isCurrency(Y)
record( ; S); subelem(S; ?:td; X); before(S; X; :td; 0; 30; Y; ); price( ; Y)
price( ; S); subtext(S; nvar[Y]; X); isCurrency(Y)
price( ; S); subtext(S; [0 9]+n:[0 9]+; X)
italics. This can be obtained by adding a contains
condition. contains(X; epd; Y ): contains(x; epd; y) evaluates to
true i x is a tree region (string source) containing a
subtree (string source) y where the root element of y matches
epd (where y matches spd). The rstsubtree condition is
a kind of \startswith" condition that states that the rst
subtree of a tree region should contain a particular element.</p>
      <p>rstsubtree(X; Y ): rstsubtree(x; y) evaluates to true i y
is the subtree rooted at the rst child of the tree region x.
lastsubtree is de ned analogously.</p>
      <p>Concept condition predicates are semantic concepts
like isCountry(X) or isCurrency(X) (see Fig. 8) or
syntactic ones like isDate(X) (or isDate(X; Y ) where the output
Y returns a standard date format), stating that a string
X represents a date, a country, or a currency, respectively.</p>
      <p>
        Some predicates are built-in to enrich the system, however
more concepts can be interactively added with assistance
of the Lixto concept editors. Syntactic predicates are
created as regular expressions, whereas semantic ones refer to
a database of ontologies (e.g. using ThoughtTreasure [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] or
Starlab Dogma [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]). Moreover, Comparison conditions
such as &lt; (X; Y ) allow comparison of concepts such as two
standard format dates.
      </p>
      <p>Pattern predicates indicate that a source belongs to a
particular pattern and refers to a particular parent
patterninstance. They are used in the head, and in the rule body
for referring to a parent pattern and for further pattern
references. As an example, the &lt;price&gt; pattern can be
constructed by using the element path de nition : ? :td, and
imposing the constraint that immediately before, a target
of pattern &lt;item&gt; needs to occur:</p>
      <p>before(S; X; : ? :td; 0; 1; Y; ); item( ; Y ):</p>
      <p>
        Range conditions restrict the matched targets
depending on their order of appearance. To any rule, a range
condition such as \[
        <xref ref-type="bibr" rid="ref3 ref7">3,7</xref>
        ]" can be added, indicating that only the
third up to the seventh matched instance within each parent
instance are matched.
4.5
      </p>
    </sec>
    <sec id="sec-13">
      <title>Elog Extraction Programs</title>
      <p>A standard extraction rule looks as follows: N ew(S; X)
P ar( ; S); Ex(S; X); Co(S; X; : : : )[a; b], where S is the
parent instance variable, X is the pattern instance variable,
Ex (S; X) is an extraction de nition atom, and the optional
Co(S; X) are further imposed conditions. A tree (string)
extraction rule uses a tree (string) extraction de nition atom
and possibly some tree (string) conditions and general
conditions. The numbers a and b are optional and serve as
range parameters. New and Par are pattern predicates
referring to the parent pattern and de ning the new pattern,
respectively. The above standard rule re ects the
principle of aggregation. In an extended environment, we
moreover allow specialisation rules such as: greentable(S; X)
table(S; X); contains(X; (:td; [color; green; exact]); ).
Additionally, an extended environment contains document lters,
using a getDocument(S,X) atom, where S is a string source
representing an URL, and X the web page the URL points
to. With such lters, one can crawl to further documents. If
document lters are used, each program has an initial lter
using the getDocument atom with user-speci ed input.</p>
      <p>The semantics of a rule is given as the set of matched
targets x: A substitution s, x for S and X evaluates New (s; x)
to true if all atoms of the body are true for this
substitution. Only those targets are extracted for which the head
of the rule resolves to true. Moreover, if the extraction
definition predicate is a subsequence predicate, only minimal
rule outputs are matched (i.e. instances that do not contain
any other instances). Observe that range criteria are applied
after non-minimal targets have been sorted out.</p>
      <p>A pattern is a set of extraction rules de ning the same
head and referring to the same parent pattern. In the visual
pattern generation the user rst enters a pattern name and
to which parent pattern the pattern belongs. All rules
created inside the pattern use this information. We distinguish
tree and string patterns. To the rst, only tree extraction
rules can be asserted, to the second one only string
extraction rules. The root pattern &lt;document&gt; is a special pattern
without lters. If using document lters to crawl to further
web pages, document patterns are used as third pattern type
(and an initial document lter is used). Parents of tree
patterns are either tree or document patterns, parent of string
patterns are tree or string patterns, and parent of document
patterns are string patterns. A pattern acts like a
disjunction of rule bodies: To be an extracted instance of a pattern,
a target needs to be in the solution set of at least one rule.
The pattern output additionally obeys a minimality
criterion. In patterns, even in those consisting of a single rule,
overlapping targets may occur.</p>
      <p>An extraction program P is a set of patterns. Elog
program evaluation di ers from Datalog evaluation in the
following three aspects: built-in predicates, various kinds of
minimisation, and use of range conditions. Moreover, the
atoms are not evaluated over an extensional database of
facts representing a web page, but directly over the parse
tree of the web page. Applying a program to an HTML
&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;document&gt;
&lt;heading&gt;Items for Sale&lt;/heading&gt;
&lt;description&gt;3 items found for "Notebooks".</p>
      <p>Showing Item 1 to 3.&lt;/description&gt;
&lt;entry&gt;
&lt;article&gt;56 K Modem PCMCIA Card for</p>
      <p>Notebooks&lt;/article&gt;
&lt;price&gt;$ 20&lt;/price&gt;
&lt;person href="mailto:itsme@bestseller.org"&gt;</p>
      <p>Angie&lt;/person&gt;
&lt;phone&gt;(01)-314 159&lt;/phone&gt;
&lt;picture/&gt;
&lt;/entry&gt;
[...]
page creates a set of hierarchically ordered tree regions and
string sources (called a pattern instance base) by applying
all patterns of the program in their hierarchical order to
this HTML document (and possibly to further HTML
documents if document lters are used). Each pattern produces
a set of instances. Each pattern instance contains a reference
to its parent instance. As patterns are ordered in a strictly
hierarchical way, the program is hierarchically strati ed. In
the nal section we will relax the de nition of patterns to
create recursive programs.</p>
      <p>As example program consider a wrapper for eBay pages
(Fig. 8). On eBay pages, every o ered item is stored in
its own table extracted by &lt;record&gt;; further patterns are
all de ned within such a record. The pattern &lt;price&gt; uses
a concept attribute, namely isCurrency { which matches
strings like $, DM, Euro, etc. The &lt;bids&gt; pattern uses a
reference to the &lt;price&gt; pattern. The nal two patterns
are string patterns.</p>
    </sec>
    <sec id="sec-14">
      <title>TESTING THE LIXTO TOOL</title>
      <p>We chose twelve example sites (Table 1), some of which
were already used for testing purposes by other wrapper
generators. Several users of whom not all are familiar with
details of HTML contributed to our test results. Initially, we
asked them to create a wrapper based on a single example
page. Table 2 summarises answers to the following
questions: (1) Is it possible to wrap this page with Lixto? (2)
How \complex" is the constructed program for this site?
(ratio of required predicates to used output patterns) (3) What
is the percentage of correctly wrapped pattern instances of
a number of randomly chosen similarly structured testpages
with a wrapper written on one example page only. (4) How
many example pages are necessary (due to structural
deviations) to get 100 percent of correctly matched pattern
instances? (5) Moreover, we specify the time needed for
constructing the initial wrapper based on one example page.
Additionally, the time for constructing one output pattern
is computed to gain a measure how much \thinking time"
was required for each output pattern. (6) In the last row
the depth of the pattern tree is speci ed.</p>
      <p>Let us describe some more details: On eBay, the initial
wrapper worked well on almost all test pages like queries on
cars, football, etc. However, one lter rule of &lt;date&gt; required
that dates must contain a colon and a dash. This matched
one item description, too, which used both. Hence, the
pattern had to be re ned based upon the knowledge of this
second page to match 100% of the patterns of all example
pages. For the CIA Factbook, the user chose a bad example
page with only one bordering country. Even after
improving the wrapper to deal with comma-separated countries,
Albania had to be treated in a special way. The wrapper for
DBLP relies on a number of intermediate auxiliary patterns,
indicated by the high nesting depth of the document. For
the CNN pages of the US election results per state, a
wrapper just extracting names of president candidates and the
received votes was written in a few minutes; due to a very
homogeneous structure, one example page was su cient to
extract these data for all states. The Jobs Jobs Jobs site is
the only example where the number of needed sample pages
depends on the number of testpages due to a wide variety
of structures for job o ers. For the Perl Module List we are
merely interested in writing a wrapper for a single web page.
This list uses mainly preformatted text, hence the program
heavily relies on string extraction. In the current
implementation some auxiliary patterns are needed, and some clever
constructions to obtain a 100% match for the ve chosen
patterns (module group, leaf patterns name, DSLI,
description, info). We conclude that almost all web pages can be
visually wrapped with Lixto. Observe, that although we
chose a rather structured le for illustrating Lixto
throughout the paper, our approach also works on pages with less
structure such as e.g. the CIA Factbook. For none of the test
pages the user had to modify the Elog program manually.
Wrapper construction is usually very fast. The program
length measured in used predicates is never unreasonably
large compared to the output patterns (ranging from 1.78
to 4.4). The user never had to consider more than three
example pages to get a 100% match for all testpages.
6.</p>
    </sec>
    <sec id="sec-15">
      <title>RELATED WORK</title>
      <p>
        First, we give an overview of approaches less related to
Lixto because they do not provide visual support.
Standalone wrapper programming languages include Florid [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]
(using a logic-programming formalism), Pillow [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] (an HTML
and XML programming library for logic programming
systems), Jedi [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] (using attributed grammars), Tsimmis and
Araneus. In Tsimmis [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], the extraction process is based on
a procedural program which skips to the required
information, allows temporary storages, split and case statements,
and to follow links. However, the wrapper output has to
obey the document structure. In Araneus [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a user can
create relational views from web pages by computationally fast
and advanced text extracting and restructuring formalisms,
in particular using procedural \Cut and Paste" exception
handling inside regular grammars. In general, all manual
wrapper generation languages are di cult to use by
laypersons.
      </p>
      <p>
        Machine learning approaches rely on learning from
examples and counterexamples of a large number of web pages.
Stalker [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] specialises general SkipTo sequence patterns
based on labelled HTML pages. An approach to maximise
speci c patterns is introduced by Davulcu et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Other
examples include Softmealy [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] (using nite-state
transducers) and MIA [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] (prolog-based wrappers using
antiuni cation; neural networks to generalise and learn texts).
NoDoSe ([
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) extracts information from plain string sources
and provides a user interface for example labelling. It has
restricted capabilities to deal with HTML. Kushmerick et al.
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] create robust wrappers based on prede ned extractors;
their visual support tool WIEN receives a set of training
pages, where the user can label relevant information and
the system tries to learn a wrapper. Their approach does
not use HTML parse trees. Kushmerick also contributed
to the wrapper veri cation problem [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], an issue worth to
explore w.r.t. Elog, too. In general, drawbacks of
machinelearning approaches are limited expressive power and the
large number of required example pages.
      </p>
      <p>
        Supervised interactive wrapper generation tools include
W4F [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] and XWrap [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. W4F uses an SQL-like query
language called HEL. Parts of the query can be generated
using a visual extraction wizard which is limited to
returning the full DOM tree path of an element. However, the full
query must be programmed by the user manually. Hence,
W4F requires expertise with both HEL and HTML. HEL
requires tricky use of index variables and fork constructs
to correctly describe a complex pattern structure. XWrap
uses a procedural rule system and provides limited
expressive power for pattern de nition. The user cannot label
regions in documents as exible as in Lixto. XWrap lacks
visual facilities for imposing external or internal conditions
to a pattern, but instead is rather template-based. The
division into two description levels and the automatic
hierarchical structure extractor severely limit the ways to de ne
extraction patterns (e.g. it is impossible to describe pattern
disjunctions). Hence, in general, other supervised wrapper
generation tools require manual postprocessing and do not
o er the browser-displayed document for labelling.
7.
      </p>
    </sec>
    <sec id="sec-16">
      <title>CURRENT/FUTURE WORK</title>
      <p>
        It is currently already possible to write and execute Elog
programs that can crawl to other pages, i.e. follow links
during extraction, and can recursively wrap linked sequences of
web pages. For such applications, the pattern structure does
no longer form a tree because lters of one pattern de nition
may refer to di erent parent patterns (in a similar fashion
as recursive data types). For example, recursive Elog
programs may follow a \next" button and navigate to further
pages during extracting, while extracting instances of the
same patterns. See Fig. 10 for extending the eBay
example of Fig. 8 to follow a \next" button, and extract for
each page the same kind of information. In this example,
the pattern &lt;document&gt; has an initial lter which uses the
user-provided page ($1), and an additional lter, which uses
&lt;nexturl&gt; as parent pattern (whose instances are strings
representing URLs). Web crawling and recursion in Lixto
is described in more detail in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Currently we are
extending the interactive pattern builder to cover these aspects.
      </p>
      <p>Furthermore, a server-based Lixto version is currently being
implemented { it uses simple web interfaces and works in the
user's favourite browser. Future work focuses on automation
heuristics for optional use, including to work on multiple
example targets at once. Additionally, Lixto wrappers will be
embedded into a personalisable information channel system.
document( ; S); subelem(S; (?:content; [(a; ; substr); (elementtext; Next; exact)]); X)
next( ; S); subatt(S; href; X)
getDocument($1; X)
nexturl( ; S); getDocument(S; X)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Abiteboul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Buneman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Suciu</surname>
          </string-name>
          .
          <article-title>Data on the Web - From Relations to Semistructured Data and XML</article-title>
          . Morgan Kaufmann,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Adelberg. NoDoSE -</surname>
          </string-name>
          <article-title>a tool for semi-automatically extracting semi-structured data from text documents</article-title>
          .
          <source>In Proc. SIGMOD</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Atzeni</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Mecca</surname>
          </string-name>
          .
          <article-title>Cut and paste</article-title>
          .
          <source>In Proc. PODS</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Baumgartner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Flesca</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Gottlob</surname>
          </string-name>
          .
          <article-title>Declarative information extraction, web crawling and recursive wrapping with Lixto</article-title>
          .
          <source>Proc. LPNMR</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Baumgartner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Flesca</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Gottlob</surname>
          </string-name>
          .
          <article-title>Supervised wrapper generation with Lixto</article-title>
          .
          <source>Proc. VLDB Demo</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Baumgartner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Flesca</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Gottlob</surname>
          </string-name>
          .
          <article-title>Visual web information extraction with Lixto</article-title>
          .
          <source>In Proc. VLDB</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cabeza</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Hermenegildo</surname>
          </string-name>
          .
          <article-title>Distributed WWW programming using (Ciao-)Prolog and the PiLLoW library</article-title>
          .
          <source>TPLP</source>
          ,
          <volume>1</volume>
          (
          <issue>3</issue>
          ),
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ceri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Comai</surname>
          </string-name>
          , E. Damiani,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fraternali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Paraboschi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Tanca</surname>
          </string-name>
          .
          <article-title>XML-GL: a graphical query language for querying and restructuring XML documents</article-title>
          .
          <source>In Proc. WWW Conf</source>
          .,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chamberlin</surname>
          </string-name>
          and al. (Eds.).
          <article-title>XQuery: A query language for XML</article-title>
          . http://www.w3.org,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Davulcu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kifer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.V.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          . Computat.
          <article-title>aspects of resilient data extract. from semistr. sources</article-title>
          .
          <source>In Proc. PODS</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Florescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Deutsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Suciu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          .
          <article-title>A query language for XML</article-title>
          .
          <source>In Proc. 8th Intern. WWW Conference</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aranha</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Crespo</surname>
          </string-name>
          .
          <article-title>Extracting semistructured information from the web</article-title>
          .
          <source>In Proc. Workshop on Mang. of Semistructured Data</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>C-N. Hsu</surname>
            and
            <given-names>M.T.</given-names>
          </string-name>
          <string-name>
            <surname>Dung</surname>
          </string-name>
          .
          <article-title>Generating nite-state transducers for semistructured data extraction from the web</article-title>
          .
          <source>Information Syst., 23/8</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Huck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fankhauser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Aberer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.J.</given-names>
            <surname>Neuhold</surname>
          </string-name>
          . JEDI:
          <article-title>Extracting and synthesizing information from the web</article-title>
          .
          <source>In Proc. COOPIS</source>
          , IEEE CS Press,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kushmerick</surname>
          </string-name>
          .
          <article-title>Wrapper veri cation</article-title>
          .
          <source>World Wide Web Journal</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kushmerick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weld</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>R.</given-names>
            <surname>Doorenbos</surname>
          </string-name>
          .
          <article-title>Wrapper induction for information extraction</article-title>
          .
          <source>In Proc. IJCAI</source>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.Y.</given-names>
            <surname>Levy</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.S.</given-names>
            <surname>Weld</surname>
          </string-name>
          .
          <article-title>Intelligent internet systems</article-title>
          .
          <source>Arti cial Intelligence</source>
          ,
          <volume>118</volume>
          (
          <issue>1-2</issue>
          ),
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>L.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pu</surname>
          </string-name>
          , and W. Han.
          <article-title>XWrap: An extensible wrapper construction system for internet information</article-title>
          .
          <source>In Proc. ICDE</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>W.</given-names>
            <surname>May</surname>
          </string-name>
          , R. Himmeroder, G. Lausen, and
          <string-name>
            <given-names>B.</given-names>
            <surname>Lud</surname>
          </string-name>
          <article-title>ascher. A uni ed framework for wrapping, mediating and restructuring information from the web</article-title>
          .
          <source>In WWWCM. Sprg. LNCS 1727</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>B.</given-names>
            <surname>McLaughlin</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Hunter</surname>
          </string-name>
          . jdom.org Package. http://www.jdom.org/.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>E. T.</given-names>
            <surname>Mueller</surname>
          </string-name>
          .
          <article-title>Natural language processing with ThoughtTreasure</article-title>
          .
          <source>Signiform</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>I.</given-names>
            <surname>Muslea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Minton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Knoblock</surname>
          </string-name>
          .
          <article-title>A hierarchical approach to wrapper induction</article-title>
          .
          <source>In Proc. 3rd Intern. Conf. on Autonomous Agents</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sahuguet</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Azavant</surname>
          </string-name>
          .
          <article-title>Building light-weight wrappers for legacy web data-sources using W4F</article-title>
          .
          <source>In Proc. VLDB</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>D.F.</given-names>
            <surname>Savarese. OROmatcher - Regular Expressions</surname>
          </string-name>
          for Java. http://www.savarese.org/oro/.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Starlab</surname>
          </string-name>
          . http://www.starlab.vub.ac.be/research/ dogma/ontologyserver.htm.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Hellerstein</surname>
          </string-name>
          .
          <article-title>Content integration for e-business</article-title>
          .
          <source>In Proc. Sigmod</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomas</surname>
          </string-name>
          .
          <article-title>Anti-uni cation based learning of T-wrappers for information extraction</article-title>
          .
          <source>In Workshop on Machine Learning for IE</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>