=Paper= {{Paper |id=Vol-99/paper-6 |storemode=property |title=Visual Web Information Extraction with Lixto |pdfUrl=https://ceur-ws.org/Vol-99/Robert_Baumgartner-et-al.pdf |volume=Vol-99 |dblpUrl=https://dblp.org/rec/conf/kcap/BaumgartnerFG01 }} ==Visual Web Information Extraction with Lixto== https://ceur-ws.org/Vol-99/Robert_Baumgartner-et-al.pdf
               Visual Web Information Extraction with Lixto                                                                             ∗




                    Robert Baumgartner                  Sergio Flesca                    Georg Gottlob
                        DBAI, TU Wien             DEIS, Università della Calabria          DBAI, TU Wien
                         Favoritenstr. 9           Via Pietro Bucci, 41C-42C                 Favoritenstr. 9
                          1040 Vienna                   87030 Rende (CS)                      1040 Vienna
                            Austria                           Italy                             Austria
                    baumgart@dbai.tuwien.ac.at         flesca@deis.unical.it           gottlob@dbai.tuwien.ac.at



ABSTRACT                                                                                                       web
                                                                                                                         set of structural
                                                                                                                         similar pages
We present new techniques for supervised wrapper genera-
tion and automated web information extraction, and a sys-                                                               example set
                                                                                                                        (usually a single page)
tem called Lixto implementing these techniques [6]. Our sys-
                                                                      user
tem can generate wrappers which translate relevant pieces of
HTML pages into XML. Lixto, of which a working prototype                                                                              XML Generator /
                                                                                                                                       Simple Query
has been implemented, assists the user to semi-automatically           Interactive     Extractor               events
                                                                                                                         Controller       System




                                                                                                     control
                                                                                                     logic
create wrapper programs by providing a fully visual and in-          Pattern Builder
                                                                                       (continual)
teractive user interface. In this convenient user-interface                                                                      Transformer
very expressive extraction programs can be created. Inter-
nally, this functionality is reflected by the new logic-based                                         Pattern
                                                                                                     Instance
declarative language Elog. Users never have to deal with                           ELOG                Base
Elog and even familiarity with HTML is not required. Lixto                        Program
                                                                                                   (hierarchically                    XML
can be used to create an “XML-Companion” for an HTML                                                  ordered)

web page with changing content, containing the continually
                                                                                                                        work on XML
updated XML translation of the relevant information.

                                                                           Figure 1: Overview of the Lixto System
1.   INTRODUCTION AND MOTIVATION
   Nowadays web content is mainly formatted in HTML.
This is not expected to change soon, even if more flexible
                                                                   query possibilities and leave you with a large number of re-
languages such as XML are attracting a lot of attention.
                                                                   sult records organised in a huge table split over many web
While both HTML and XML are languages for represent-
                                                                   pages. You have to wade through all these records manu-
ing semistructured data, the first is mainly presentation-
                                                                   ally, because of no possibility to further restrict the result.
oriented and is not really suited for database applications.
                                                                   Another drawback is that you cannot directly collect infor-
XML, on the other hand, separates data structure from lay-
                                                                   mation of different auction sites (e.g. onetwosold and ebay
out and provides a much more suitable data representation
                                                                   items together) into a single structured file, a difficult task
(cf. e.g. [1, 17]). A set of XML documents can be regarded
                                                                   of web information integration due to very different presen-
as a database and can be directly processed by a database
                                                                   tation on each site.
application or queried via one of the new query languages
                                                                      The solution is thus to use wrapper technology to extract
for XML, such as XML-GL [8], XML-QL [11] and XQuery
                                                                   the relevant information from HTML documents and trans-
[9]. As the following example shows, the lack of accessibil-
                                                                   late it into XML which can be easily queried or further pro-
ity of HTML data for querying has dramatic consequences
                                                                   cessed. Based on a new method of identifying and extracting
on the time and cost spent to retrieve relevant information
                                                                   relevant parts of HTML documents and translating them
from web pages.
                                                                   to XML format, we designed and implemented the efficient
   Imagine you would like to monitor interesting eBay offers
                                                                   wrapper generation tool Lixto, which is particularly well-
(www.ebay.com) of notebooks, where an interesting offer is,
                                                                   suited for building HTML/XML wrappers and introduces
for example, defined by an auction item which contains the
                                                                   new ideas and programming language concepts for wrapper
word “notebook”, has current value between gbp 1500 and
                                                                   generation. Once a wrapper is built, it can be applied auto-
3000 and which has received at least three bids so far. The
                                                                   matically to continually extract relevant information from a
eBay site does not offer the possibility to formulate such
                                                                   permanently changing web page.
complex queries. Similar sites do not even give restricted
                                                                      The Lixto method and system fulfills the requirements
∗                                                                  specified in a very recent paper on e-commerce tools [26]:
  All new methods and algorithms presented in this paper
are covered by a pending patent. Future developments of            “These tools must be targeted at typical, non-technical con-
Lixto will be reported at www.lixto.com. This paper also           tent managers. In order to be usable, the tools must be
appeared in the proceedings of VLDB 2001.                          graphical and interactive, so that content managers see data
                                                                   as it is being mapped.” Lixto’s distinctive features are sum-
                                                                   marised in the following. Lixto is easy to learn and use be-
cause a fully visual and interactive user interface is provided.
Neither manual fine-tuning nor knowledge of the internal
language is necessary. Lixto uses straightforward region                                         current
                                                                                                 example
marking and selection procedures that allow even those users                                     source            selected
not familiar with HTML to work with the wrapper genera-                                                            after element
tor. Lixto lets a wrapper designer work directly and solely
on browser-displayed example pages, unlike other tools (see
Section 6), that force the designer to work with other doc-                                            current example target
ument views such as, e.g., table-views of the document or
displayed HTML parse trees, or even HTML sources. Af-
ter selecting example targets in the browser display, Lixto
responds with highlighted targets in the same display (see                       Figure 2: The Lixto Browser
Section 3). With Lixto, very expressive visual wrapper gen-
eration is possible: It allows for extraction of target patterns
based on surrounding landmarks, on the contents itself, on         HTML pages of similar structure.
HTML attributes, on the order of appearance and on se-               With the controller of the XML Generator, the user chooses
mantic and syntactic concepts. Extraction is not limited to        how to map extracted information to XML. Its transformer
tokens of some document object model, but also possible            module performs the actual translation from the extracted
from flat strings. Multiple and single targets are treated in      pattern instance base to XML.
a uniform way. Lixto even allows for more advanced features
such as disjunctive pattern definitions, crawling to other         3.    WRAPPER GENERATION
pages during extraction, recursive wrapping. Moreover, the
extracted data structures do not have to strictly obey the         3.1    Creating Wrappers
input HTML structure. Preliminary results on representa-
                                                                       A Lixto wrapper is created interactively by creating pat-
tive web pages with using the current Lixto prototype show
                                                                   terns in a hierarchical order. For example, one can first de-
a good performance (see Section 5).
                                                                   fine a pattern  and then define a subpattern .
   The above mentioned features are internally reflected by
                                                                   The subpattern relationship in this case expresses that each
a declarative extraction language called Elog (see Section 4),
                                                                   extracted instance of  must occur within one in-
which uses a datalog-like logical syntax and semantics. Elog
                                                                   stance of . Pattern names act as default XML ele-
is invisible to the user. It is ideally suited for representing
                                                                   ment names. Each pattern characterises one kind of infor-
and successively incrementing the knowledge about patterns
                                                                   mation. The set of extracted instances of a pattern, which
described by users. This knowledge is generated in an in-
                                                                   are either HTML elements, list of elements, or strings, de-
teractive process consisting of successive narrowing (logical
                                                                   pends on the current page. Each pattern is defined by one
and ) and broadening (logical or ) steps. An Elog program
                                                                   or more filters. A filter e.g. allows the system to identify a
is a collection of datalog-like rules containing special extrac-
                                                                   set of similar nodes of the HTML parse tree, for instance a
tion conditions in their bodies. Elog is flexible, intuitive and
                                                                   set of items internally represented as .
easily extensible.
                                                                       A filter is created as follows: First, the user highlights
   This paper is structured as follows. In the next section
                                                                   with the mouse a representative instance of the desired tar-
the system architecture is described, in Section 3 we give an
                                                                   get pattern directly on the example page. Internally, the
overview of the the interactive pattern generation and visual
                                                                   system associates to this instance a generalised tree path in
UI, whereas Section 4 is devoted to the theory of the Elog
                                                                   the HTML parse tree identifying similar instances and incor-
extraction language. Section 5 presents empirical results
                                                                   porates this as main goal of a Elog rule representing the filter
of using the Lixto wrapper generator, Section 6 discusses
                                                                   (see Section 4). Second, the user adds restrictive conditions
related approaches and Section 7 highlights future research
                                                                   to the filter. These are reflected by the system as additional
directions.
                                                                   goals in the rule body describing this filter. The possible
                                                                   conditions, which will be explained in more detail, include:
2.   ARCHITECTURE / IMPLEMENTATION                                 (a) before/after conditions that express that the target
   A working prototype of Lixto already has been imple-            pattern instance must appear before or after some specific
mented with Java using Swing, OroMatcher [24] and JDOM             element. (b) notbefore/notafter conditions that express
[20]. The Lixto Toolkit (Figure 1) consists of the following       that some specific element must not be close to the target
modules:                                                           pattern. (c) internal conditions that express that some
   The Interactive Pattern Builder provides the visual UI          specific element must (not) appear inside the target pattern.
that allows a user to specify the desired extraction patterns      (d) range conditions which, in case of multiple matchings,
and the basic algorithm for creating a corresponding Elog          restrict the set of matched instances to a subinterval.
wrapper as output.                                                     Adding a filter to a pattern extends the set of extracted
   The Extractor is the Elog program interpreter that per-         targets, whereas imposing a condition to a filter restricts the
forms the actual extraction based on a given Elog program.         set of targets. Alternately imposing conditions and adding
The extractor, provided with an HTML document and a pre-           new filters can perfectly characterise the desired informa-
viously constructed program, generates as its output a pat-        tion. The system creates Elog rules based on user-defined
tern instance base, a data structure encoding the extracted        filters. The user is never concerned with the internal lan-
instances as hierarchically ordered trees and strings. One         guage Elog. The user interface is extremely simple and the
program as input of the extractor can be used for contin-          entire wrapper construction process can be learned by an
ual extraction on changing pages, or to extract from several       average user in very short time. The user is guided through
            filter ∧
                       condition   User is shown currently
                                   matched instances of T       no    Extend filter by adding
                                                                      a constraint condition
                                                                                                          Condition Builder                      adds one condition to a filter
                                   within all instances of S
                       condition                                       (I/A). See Condition
pattern ∨                           and asked if satisfied
                                   (= no unwanted target
                                                                         Builder Figure for                                                                    Characterise
                                                                               details.                     Select a range by indicating the first            element that is
            filter ∧
                                         matched).(A)
                       condition                                                                           and last relevant target instance to be            not allowed to
                                                                                                                        extracted (I)                         appear before/
                                                                yes
                                                                                                                                                             after target. (I/A)
    Select a suitable in-          Highlight an instance t            Let d be the set of filters
   stance s of the parent           of T within s. (I) Select          so far constructed for                          range         notbf           notaf
  pattern S containing an          characteristic attributes           pattern T. Add current
  instance of the desired           of t. (I/A) System cre-              filter to d. System                                                     Select an element                     Set the
    target pattern T. (I)           ates main rule goal of             adds a corresponding                                         before    before/after the example                distance
                                     the desired filter. (A)            rule to program. (A)              Select type of condition.           instance and within the                tolerance
                                                                                                                     (I)            after     parent instance; user is             in percent to
                            no
  New Pattern                                                                                                                                  guided by wizards. (I)               left/right. (I)
  Generation                                                            Add d to the program
                                                                         and remember all
                                                                                                                       internal
  Input:                           Test whether d extracts
                                   exactly the desired set               instances of T for                  Select an element
  Parent pattern S                  of instances of T. (I/A)    yes        future pattern                                                     Choose attributes of selected element
  Output:                                                                                                 inside the instance and
                                                                        generation steps. (A)                                                to be considered (e.g. content, font-type
                                                                                                              within the parent
  Child pattern T defined                                                                                     instance; user is               etc.). A wizard automatically proposes
  as a set of rules d                                                                                      guided by wizards. (I)              particularly relevant attributes. (I/A).



            Figure 3: Generation of a new Pattern
                                                                                                                      Figure 4: Adding Conditions

a supervised pattern generation, and by simply marking rel-
evant information items on-screen and visually setting con-                                         Elog rule defining the filter at hand. Each filter is intended
straints, filters and patterns are created. In [5], we describe                                     to extract a subset of the desired target set. If the current
an example program construction.                                                                    pattern is less general than intended by the user, another fil-
                                                                                                    ter can be added, internally reflected by an additional Elog
3.2      Pattern Creation Algorithm                                                                 rule for the same pattern (several rules for the same pattern
    The generation of a pattern is described in Fig. 3. The                                         are interpreted disjunctively, as usual in Datalog). Different
user can hierarchically define and refine patterns. She en-                                         filters may be created based on labelling in different exam-
ters a pattern name, specifies the parent pattern S, selects                                        ple parent pattern instances. By iterating restricting and
by mouse clicks one example instance s of the parent pat-                                           generalising steps, it is usually possible to describe a desired
tern and marks (with the mouse) an element (e.g. one price)                                         pattern perfectly. Once a pattern is defined, the user may
inside this instance on the sample page.                                                            use this pattern as parent for a new pattern. A detailed
    At the beginning, i.e. when facing a new HTML docu-                                             creation of a simple example wrapper is given in [5].
ment (which is loaded into an internal browser; see Fig.
2) and having created a new program, the only pattern is                                            3.3     The Visual Interface
 with a unique instance, the current example                                                 The current implementation includes visual tree pattern
document. Fig. 3 distinguishes interactive (I) and automatic                                        construction and the use of string patterns. All filter con-
(A) steps and gives the logical pattern structure in its top-                                       ditions discussed in this paper are supported. Moreover,
left corner. A pattern may consist of multiple filters. Each                                        the visual interface is assisted by an XML visualisation tool
filter contains a number of conditions. An extracted instance                                       which at each instant shows the user the so far extracted
must satisfy all conditions of at least one filter. Two consec-                                     XML code. A concept atom generator to create predefined
utive mouse clicks on different parts of the current parent                                         concepts (such as “isCity”, “isDate”) based both on regular
instance are interpreted in the best possible way to mark an                                        expressions and on reading some database tables is currently
HTML element of the document parse tree (cf. Fig. 7) or if                                          being added. Such concepts are especially useful to allow
not possible a list of elements.                                                                    users to create string patterns without knowledge of regular
    The system generates a basic filter without conditions,                                         expressions. Fig. 5 shows the main menu of Lixto (left-hand
but the user can already state some attribute requirements                                          side). There, a new program can be created or an existing
(the system constructs a suitable element path definition,                                          one loaded, new patterns can be added, the document for
see Section 4). Then it highlights all objects on the current                                       labelling can be chosen, etc. The same figure shows on its
example page that match these initial filter criteria (not only                                     right hand side the source selection dialogue which enables
in the current pattern instance, but in all pattern instances).                                     the user to select at which node to create a new pattern. Fig.
Sometimes a user wants a single match within one source,                                            2 shows the internal Lixto browser when selecting an after
sometimes multiple matches – this makes no difference in                                            element. For each condition, an own interface is provided
the algorithm – it just depends on the definition of filters                                        which uses the user-labelled information.
and conditions. E.g., if the user marks a table row, the
system recognises the entity  and highlights all table                                          3.4     Translation into XML
rows occurring at a comparable level in the document. At                                               The output by the extractor is well-suited for translation
the same time the system constructs a general Elog rule for                                         into XML. The interactive XML generator exploits the hi-
extracting table rows.                                                                              erarchical structure of the pattern instance base and uses
    If the user is satisfied with the elements identified by the                                    pattern names as default XML element names. The user
system, she can confirm the pattern definition. Satisfac-                                           can interactively choose the HTML attributes that appear
tion, in this context, means that only desired targets are                                          in the XML output. Even more important is the possibility
matched. Otherwise, if the concept is too general, then she                                         to decide which patterns are written to XML, possibly using
can add restricting conditions (which are reflected by Elog                                         auxiliary patterns. Fig. 9 displays the result of applying a
condition predicates); cf. Fig. 4. For each such restriction,                                       (not illustrated) wrapper program onto the web page of Fig.
the system adds the corresponding condition atom to the                                             6.
         open an url                                      test and save the current                       Currently constructed            Return to
         used for creating create a new pattern (naming; patterns (all its filters)                       hierarchy of patterns             menu
         patterns              enables adding of filters)      Highlighting
                                                               Option
 currently
 constructed
 pattern

 pattern and
 filter deletion
 option menu

     current
     status
           current program name or                                  Select
          “new program” if program current document which is parent
            is not yet saved (hidden) used for program creation     pattern
                                              Create the new
                                              pattern and select                Choose another                    A leaf node      navigate through
                                              highlighted instance as           pattern instead                   of the pattern   pattern instances
                                              example target of first filter                                      tree             to select example
                                                                                                                                   parent instance

                     Figure 5: Main Menu and New Pattern Generation Menu of current prototype


4.     DATA EXTRACTION                                                               http://www.dbai.tuwien.ac.at/lixto.html



4.1      A first glance at Elog
   Elog is the system-internal datalog-like rule based lan-
guage specifically designed for hierarchical and modular data
extraction. A user of Lixto does not have to learn Elog and
never sees the Elog program. Elog rules are the implemen-
tations of the visually defined filters and define elements to
be extracted from web pages. Before we discuss the features
                                                                                                Figure 6: HTML Example Page
of the language in detail, have a look at Fig. 8, in particular
at the rule with head predicate record(S, X). Observe that
we use as in Prolog the same variables for each rule, and
denote with “ ” a variable in whose instantiations we are                      sets.
not interested. This predicate identifies records on an eBay                      Observe that in our chosen document object model, sev-
page (each one is an own table). The first atom in the rule                    eral leaf elements are  elements – this parser treats
body specifies that the context S of the extraction, i.e. the                  tags such as  (bold-face) as attributes of an imaginary
so-called parent pattern, is an instance of . The                     element. We introduced a special attribute called
second atom in the rule body looks for subelements that                        “elementtext” for each element. This attribute reflects the
qualify as tables inside the unique  instance and                    contents of the element, which is in case of an internal node
instantiates X with them. Given that the same Elog pro-                        the left-to-right concatenation of the leaf elements below the
gram can be applied to different web pages, the actual ele-                    internal node. In the following, we distinguish tree regions
ments that an Elog program defines and extracts depend on                      and subtrees of the HTML tree. A tree region is a region
the current web page. For this reason, we refer to the head                    rooted at an internal node of the HTML tree where only the
predicates defined by an Elog program as patterns. More-                       i-th up to the j-th child and their descendants are consid-
over, we denote a set of rules with the same head as pattern,                  ered. Observe that a tree region is contiguous. A subtree
too. The syntax and semantics of Elog and its predicates is                    is the tree rooted at one node of the HTML tree, i.e. all
explained below (only informally due to space constraints).                    descendants are considered.

4.2      Document Model                                                        4.3      Extraction Mechanisms
   Consider the example web page lixto.html of Fig. 6                             Lixto offers two basic mechanisms of data extraction –
and its parse tree as displayed in Fig. 7 based on the Java                    tree and string extraction. For tree extraction, we iden-
Swing parser. The values in brackets are the start and                         tify elements with their corresponding tree paths and possi-
end-offsets (in characters) of the corresponding elements in                   bly some properties of the elements themselves. This does
the actual document. Additionally, we number nodes in a                        not necessarily identify a single element. As an example,
depth-first left-to-right fashion. Nodes of the HTML tree                      ?.table. ? .tr is a valid tree path. In the sample page page of
refer to elements which are represented as sets. The set                       Fig. 6, three elements are matched. The star acts as wild-
contains pairs describing the association between attribute                    card. The expression . ? .x matches all paths to x which con-
names and corresponding attribute values. E.g., the                      tain x as last element only. A plain tree path is a sequence
element node of Fig. 7 is associated with {(name,body), (bg-                   of consecutive nodes in a subtree of an HTML tree. In an
color,FFFFFF), (elementtext,Items for . . . . . . 137)} (whole                 incompletely specified tree path stars may be used instead
document text). Fig. 7 highlights two other such attribute                     of element names. For simplicity, incompletely specified tree
                                                                                                 (75,276)              expressions are powerful tools for text processing and match-
                                     1                                                    (name, table) (border,1)     ing. Refer to [24] for a Java regular expression library. Ex-
                                         body                                           (elementtext,56 K Modem....)
                                         (6,277)                                               (width,75%)
                                                                                                                       traction generates minimal non-overlapping substrings. The
                                                                                                                       final two patterns of Fig. 8 give an example of string extrac-
    2                    5                                                                                             tion. An attribute path definition apd helps to extract values
          h4             p-implied                 center                           p                                  of attributes. It is simply a string (expressing the attribute
                                                    (23,276)                                                           name).
3          4         6
content    content       hr     content             p               table           content
                                                                    (75,276)
                                                                                                                       4.4    Language Definition
                                                                                                                           Elog atoms correspond to special predicates with a well-
           (257,263)
                                content       content          tr              tr             tr                       defined semantics. They operate on source objects (tree re-
                                                                                                                       gions and string sources), path definition objects and numer-
        (name, content)
(a, href = "mailto:steven@...")
                                                                                                                       ical arguments and obey binding conventions. In a datalog-
                                                   td          td              td             td                       like language, the function mapping a given source S to
  (href, "mailto:steven@...")
                                                                            p-implied
                                                                                                                       a set of elements matching an epd is treated as relation
    (elementtext, Steven)
                                                                                                                       subelem(S, epd, X). subelem(s, epd, x) evaluates to true iff
                                                                     content        content                            s is a tree region, epd is an element path definition and x
                                                                                                                       is a tree region contained in s where the root of x matches
                                                                                                                       epd. Note that the tree path specified in a tree extraction
Figure 7: HTML Parse Tree of Example in Figure 6                                                                       definition predicate is always relative to the parent-pattern
                                                                                                                       instance.
                                                                                                                           Extraction definition predicates specify a set of ex-
paths are referred to as tree paths. The semantics of a tree                                                           traction instances. One of these is subelem. As far as string
path applied to a tree region of an HTML page is defined as                                                            extraction is concerned, the predicate subtext(S, spd, X) is
the set of matched elements.                                                                                           used. There, S is either a tree region or a string source, and
   Attribute Conditions are constraints reducing the num-                                                              X a string source. Two more extraction definition predicates
ber of matched elements. They pose requirements on oc-                                                                 are built-in. (1) subsq(S, epd, fpd , lpd , X): If s and x are tree
curring attributes and their values. An attribute condition                                                            regions, epd is an element path definition, and fpd and lpd
is a triple specifying a required name, a required value (a                                                            are simple element path definitions, subsq(s, epd, fpd , lpd , x)
string, or in case the third parameter is regvar, a regu-                                                              evaluates to true iff the root of x satisfies epd, its first child
lar expression possibly containing some variables indicated                                                            satisfies fpd and its last one lpd . (2) subatt(S, apd, X): If
by \var), and a special parameter exact, substr or regvar,                                                             s is a tree region, x a string source and att is an attribute
indicating that the attribute value is exactly the required                                                            path definition of the root element of s, then subatt(s, apd, x)
string, is a superstring of it, or matches the given regular                                                           evaluates to true iff x is the value of apd. subatt gives the
expression, respectively. Instead of giving a formal defini-                                                           possibility to extract the values of attributes.
tion, we illustrate this with an example: (?.hr, [(size, [3 −                                                              Context condition predicates specify that some other
4]∗, regvar ), (width, %, substr )]) identifies horizontal rules of                                                    subtree or text must (not) appear before or after the de-
size 3 or 4 with a width specified in percent. Each output                                                             sired extraction target. For example, on a page with several
variable, which is included in the second parameter must be                                                            tables, the final table could be identified by an external con-
used as input for a concept of the same rule (cf. Section 4.4).                                                        dition stating that no table appears after the desired table.
   An element path definition epd consists of a tree path and                                                          Before predicates are explained here, after predicates work
a set of attribute conditions. It is called simple if it consists                                                      analogously. (1) before(S, X, epd, b, e, Y, P ): If s and x are
of one element name only. The semantics of applying an                                                                 tree regions, then before(s, x, epd, b, e, y, p) evaluates to true
element path definition to a tree region of an HTML tree is                                                            iff y is a subtree whose root node is matched by epd and the
given as the set of matched elements of the corresponding                                                              end offset of y precedes the start offset of x within relative
tree path which moreover satisfy all of the attribute con-                                                             distance p where b ≤ p ≤ e. (2) notbefore(S, X, epd, d): If s
ditions. Instead of element path definitions, equivalently,                                                            and x are tree regions, then notbefore(s, x, epd, d) evaluates
XPath expressions can be used (with some extensions, such                                                              to true iff no element satisfying epd precedes x within rela-
as the possibility to express that an attribute value is a con-                                                        tive distance d. The same predicates are defined for string
cept). To simplify presentation, however, we stick to our                                                              extraction: There, S is an arbitrary source, X is required to
introduced notation.                                                                                                   be a string source, spd is used instead of epd and instead of
   The second extraction method relies on strings. In the                                                              the root node simply the string itself is used. The percentual
HTML parse tree, strings are represented by the text of                                                                distance values b and e define the tolerance interval where
content leaves. However, we associate a string to every node                                                           the element is allowed to occur inside the current parent-
of the parse tree available as the value of the attribute ele-                                                         pattern instance. Additionally, a condition predicate may
menttext. For instance when extracting access codes of the                                                             contain new variables Y and P , which can be referred by
phone-numbers of lixto.html, string extraction has to be                                                               other conditions. To express that an element occurs any-
used. A substring of the elementtext of an HTML tree is de-                                                            where within the parent instance and before the target (or a
noted as string source. One can express that a string source                                                           condition output), the distance values are set to 0 and 100,
must match a given regular expression. A string path defi-                                                             respectively.
nition spd is a regular expression possibly containing some                                                                Internal conditions predicates impose conditions on
variables (variable Y indicated by \var[Y ]) which appear in                                                           the internal structure. Imagine, for instance, one wants to
some concept predicate of the corresponding rule. Regular                                                              extract all tables containing somewhere a word typeset in
           tablesq(S, X) ← document(“www.ebay.com/”, S), subsq(S, (.body, []), (.table, []), (.table, []), X),
                           before(S, X, (.table, [(elementtext, item, substr]), 0, 0, , ), after(S, X, .hr, 0, 0, , )
            record(S, X) ← tableseq( , S), subelem(S, .table, X)
           itemnum(S, X) ← record( , S), subelem(S, ?.td, X), notbefore(S, X, .td, 100)
           itemdes(S, X) ← record( , S), subelem(S, (?.td. ? .content, [(a, , substr)], X)
             price(S, X) ← record( , S), subelem(S, (?.td, [(elementtext, \var[Y].∗, regvar)]), X), isCurrency(Y)
              bids(S, X) ← record( , S), subelem(S, ?.td, X), before(S, X, .td, 0, 30, Y, ), price( , Y)
          currency(S, X) ← price( , S), subtext(S, \var[Y], X), isCurrency(Y)
           pricewc(S, X) ← price( , S), subtext(S, [0 − 9]+ \.[0 − 9]+ , X)


                               Figure 8: Elog Extraction Program for Information on eBay



italics. This can be obtained by adding a contains condi-           range parameters. New and Par are pattern predicates re-
tion. contains(X, epd, Y ): contains(x, epd, y) evaluates to        ferring to the parent pattern and defining the new pattern,
true iff x is a tree region (string source) containing a sub-       respectively. The above standard rule reflects the princi-
tree (string source) y where the root element of y matches          ple of aggregation. In an extended environment, we more-
epd (where y matches spd). The firstsubtree condition is            over allow specialisation rules such as: greentable(S, X) ←
a kind of “startswith” condition that states that the first         table(S, X), contains(X, (.td, [color, green, exact]), ). Addi-
subtree of a tree region should contain a particular element.       tionally, an extended environment contains document filters,
firstsubtree(X, Y ): firstsubtree(x, y) evaluates to true iff y     using a getDocument(S,X) atom, where S is a string source
is the subtree rooted at the first child of the tree region x.      representing an URL, and X the web page the URL points
lastsubtree is defined analogously.                                 to. With such filters, one can crawl to further documents. If
   Concept condition predicates are semantic concepts               document filters are used, each program has an initial filter
like isCountry(X) or isCurrency(X) (see Fig. 8) or syntac-          using the getDocument atom with user-specified input.
tic ones like isDate(X) (or isDate(X, Y ) where the output             The semantics of a rule is given as the set of matched tar-
Y returns a standard date format), stating that a string            gets x: A substitution s, x for S and X evaluates New (s, x)
X represents a date, a country, or a currency, respectively.        to true if all atoms of the body are true for this substitu-
Some predicates are built-in to enrich the system, however          tion. Only those targets are extracted for which the head
more concepts can be interactively added with assistance            of the rule resolves to true. Moreover, if the extraction def-
of the Lixto concept editors. Syntactic predicates are cre-         inition predicate is a subsequence predicate, only minimal
ated as regular expressions, whereas semantic ones refer to         rule outputs are matched (i.e. instances that do not contain
a database of ontologies (e.g. using ThoughtTreasure [21] or        any other instances). Observe that range criteria are applied
Starlab Dogma [25]). Moreover, Comparison conditions                after non-minimal targets have been sorted out.
such as < (X, Y ) allow comparison of concepts such as two             A pattern is a set of extraction rules defining the same
standard format dates.                                              head and referring to the same parent pattern. In the visual
   Pattern predicates indicate that a source belongs to a           pattern generation the user first enters a pattern name and
particular pattern and refers to a particular parent pattern-       to which parent pattern the pattern belongs. All rules cre-
instance. They are used in the head, and in the rule body           ated inside the pattern use this information. We distinguish
for referring to a parent pattern and for further pattern ref-      tree and string patterns. To the first, only tree extraction
erences. As an example, the  pattern can be con-             rules can be asserted, to the second one only string extrac-
structed by using the element path definition . ? .td, and          tion rules. The root pattern  is a special pattern
imposing the constraint that immediately before, a target           without filters. If using document filters to crawl to further
of pattern  needs to occur:                                   web pages, document patterns are used as third pattern type
                                                                    (and an initial document filter is used). Parents of tree pat-
           before(S, X, . ? .td, 0, 1, Y, ), item( , Y ).
                                                                    terns are either tree or document patterns, parent of string
   Range conditions restrict the matched targets depend-            patterns are tree or string patterns, and parent of document
ing on their order of appearance. To any rule, a range con-         patterns are string patterns. A pattern acts like a disjunc-
dition such as “[3,7]” can be added, indicating that only the       tion of rule bodies: To be an extracted instance of a pattern,
third up to the seventh matched instance within each parent         a target needs to be in the solution set of at least one rule.
instance are matched.                                               The pattern output additionally obeys a minimality crite-
                                                                    rion. In patterns, even in those consisting of a single rule,
4.5    Elog Extraction Programs                                     overlapping targets may occur.
   A standard extraction rule looks as follows: N ew(S, X) ←           An extraction program P is a set of patterns. Elog pro-
P ar( , S), Ex(S, X), Co(S, X, . . . )[a, b], where S is the par-   gram evaluation differs from Datalog evaluation in the fol-
ent instance variable, X is the pattern instance variable,          lowing three aspects: built-in predicates, various kinds of
Ex (S, X) is an extraction definition atom, and the optional        minimisation, and use of range conditions. Moreover, the
Co(S, X) are further imposed conditions. A tree (string) ex-        atoms are not evaluated over an extensional database of
traction rule uses a tree (string) extraction definition atom       facts representing a web page, but directly over the parse
and possibly some tree (string) conditions and general con-         tree of the web page. Applying a program to an HTML
ditions. The numbers a and b are optional and serve as

                                                                   one item description, too, which used both. Hence, the pat-

                                                                   tern had to be refined based upon the knowledge of this
 Items for Sale
 3 items found for "Notebooks".                       second page to match 100% of the patterns of all example
          Showing Item 1 to 3.                       pages. For the CIA Factbook, the user chose a bad example
                                                            page with only one bordering country. Even after improv-
    
56 K Modem PCMCIA Card for ing the wrapper to deal with comma-separated countries, Notebooks
Albania had to be treated in a special way. The wrapper for $ 20 DBLP relies on a number of intermediate auxiliary patterns, indicated by the high nesting depth of the document. For Angie the CNN pages of the US election results per state, a wrap- (01)-314 159 per just extracting names of president candidates and the received votes was written in a few minutes; due to a very
homogeneous structure, one example page was sufficient to [...] extract these data for all states. The Jobs Jobs Jobs site is the only example where the number of needed sample pages depends on the number of testpages due to a wide variety Figure 9: XML translation of lixto.html of structures for job offers. For the Perl Module List we are merely interested in writing a wrapper for a single web page. This list uses mainly preformatted text, hence the program page creates a set of hierarchically ordered tree regions and heavily relies on string extraction. In the current implemen- string sources (called a pattern instance base) by applying tation some auxiliary patterns are needed, and some clever all patterns of the program in their hierarchical order to constructions to obtain a 100% match for the five chosen this HTML document (and possibly to further HTML doc- patterns (module group, leaf patterns name, DSLI, descrip- uments if document filters are used). Each pattern produces tion, info). We conclude that almost all web pages can be a set of instances. Each pattern instance contains a reference visually wrapped with Lixto. Observe, that although we to its parent instance. As patterns are ordered in a strictly chose a rather structured file for illustrating Lixto through- hierarchical way, the program is hierarchically stratified. In out the paper, our approach also works on pages with less the final section we will relax the definition of patterns to structure such as e.g. the CIA Factbook. For none of the test create recursive programs. pages the user had to modify the Elog program manually. As example program consider a wrapper for eBay pages Wrapper construction is usually very fast. The program (Fig. 8). On eBay pages, every offered item is stored in length measured in used predicates is never unreasonably its own table extracted by ; further patterns are large compared to the output patterns (ranging from 1.78 all defined within such a record. The pattern uses to 4.4). The user never had to consider more than three a concept attribute, namely isCurrency – which matches example pages to get a 100% match for all testpages. strings like $, DM, Euro, etc. The pattern uses a reference to the pattern. The final two patterns are string patterns. 6. RELATED WORK First, we give an overview of approaches less related to Lixto because they do not provide visual support. Stand- 5. TESTING THE LIXTO TOOL alone wrapper programming languages include Florid [19] We chose twelve example sites (Table 1), some of which (using a logic-programming formalism), Pillow [7] (an HTML were already used for testing purposes by other wrapper and XML programming library for logic programming sys- generators. Several users of whom not all are familiar with tems), Jedi [14] (using attributed grammars), Tsimmis and details of HTML contributed to our test results. Initially, we Araneus. In Tsimmis [12], the extraction process is based on asked them to create a wrapper based on a single example a procedural program which skips to the required informa- page. Table 2 summarises answers to the following ques- tion, allows temporary storages, split and case statements, tions: (1) Is it possible to wrap this page with Lixto? (2) and to follow links. However, the wrapper output has to How “complex” is the constructed program for this site? (ra- obey the document structure. In Araneus [3], a user can cre- tio of required predicates to used output patterns) (3) What ate relational views from web pages by computationally fast is the percentage of correctly wrapped pattern instances of and advanced text extracting and restructuring formalisms, a number of randomly chosen similarly structured testpages in particular using procedural “Cut and Paste” exception with a wrapper written on one example page only. (4) How handling inside regular grammars. In general, all manual many example pages are necessary (due to structural de- wrapper generation languages are difficult to use by layper- viations) to get 100 percent of correctly matched pattern sons. instances? (5) Moreover, we specify the time needed for Machine learning approaches rely on learning from exam- constructing the initial wrapper based on one example page. ples and counterexamples of a large number of web pages. Additionally, the time for constructing one output pattern Stalker [22] specialises general SkipTo sequence patterns is computed to gain a measure how much “thinking time” based on labelled HTML pages. An approach to maximise was required for each output pattern. (6) In the last row specific patterns is introduced by Davulcu et al. [10]. Other the depth of the pattern tree is specified. examples include Softmealy [13] (using finite-state trans- Let us describe some more details: On eBay, the initial ducers) and MIA [27] (prolog-based wrappers using anti- wrapper worked well on almost all test pages like queries on unification; neural networks to generalise and learn texts). cars, football, etc. However, one filter rule of required NoDoSe ([2]) extracts information from plain string sources that dates must contain a colon and a dash. This matched and provides a user interface for example labelling. It has Name Website Used Example Page Testpages Amazon http://www.amazon.com/ Lord of the Rings 10 CIA Factbook www.odci.gov/cia/publications/factbook/ United Kingdom 12 Cinemachine www.cinemachine.com/ The World is not enough 15 DBLP www.informatik.uni-trier.de/~ley/db/ Michael Ley 10 Election Res. / State www.cnn.com/ELECTION/2000/results/ Alabama 50 eBay www.ebay.com/ query on Notebooks 20 Excite Weather www.excite.com/weather/forecast London (UK) 12 Jobs-Jobs-Jobs www.jobsjobsjobs.com/ 23370 10 Perl Module List www.cpan.org/modules/00modlist.long.html single huge page ex.pg. Travelnotes www.travelnotes.org/ query on Istanbul 10 Yahoo People Email people.yahoo.com/ query on Mayer 15 Yahoo Weather weather.yahoo.com/ Paris 15 Table 1: Some of the test-sites used for Lixto Name wrapable? Complexity Correct for 100% Time/Pattern (mins) Depth Amazon yes 16/9 = 1.78 95% 3 22/9 = 2.44 4 CIA Factbook yes 17/5 = 3.4 80% 3 18/5 = 3.6 3 Cinemachine yes 6/4 = 1.5 100% 1 16/4 = 4 2 DBLP yes 27/9 = 3 90% 2 54/9 = 6 8 Election Results / State yes 4/2 = 2 100% 1 6/2 = 3 2 eBay yes 19/8 = 2.38 99.9% 2 21/8 = 2.63 4 Excite Weather yes 22/7 = 3.14 100% 1 30/7 = 4.3 3 Jobs Jobs Jobs yes 21/12 = 1.75 90% 3 40/12 = 3.3 2 Perl Module List yes 22/5 = 4.4 (100 % ) (1) 60/5 = 14 6 Travelnotes yes 11/4 = 2.75 95% 2 20/4 = 5 2 Yahoo People Email yes 10/3 = 3.3 100% 1 24/3 = 8 3 Yahoo Weather yes 22/10 = 2.2 100% 1 12/10 = 1.2 3 Table 2: Evaluation of wrapper generation restricted capabilities to deal with HTML. Kushmerick et al. generation tools require manual postprocessing and do not [16] create robust wrappers based on predefined extractors; offer the browser-displayed document for labelling. their visual support tool WIEN receives a set of training pages, where the user can label relevant information and 7. CURRENT/FUTURE WORK the system tries to learn a wrapper. Their approach does not use HTML parse trees. Kushmerick also contributed It is currently already possible to write and execute Elog to the wrapper verification problem [15], an issue worth to programs that can crawl to other pages, i.e. follow links dur- explore w.r.t. Elog, too. In general, drawbacks of machine- ing extraction, and can recursively wrap linked sequences of learning approaches are limited expressive power and the web pages. For such applications, the pattern structure does large number of required example pages. no longer form a tree because filters of one pattern definition Supervised interactive wrapper generation tools include may refer to different parent patterns (in a similar fashion W4F [23] and XWrap [18]. W4F uses an SQL-like query as recursive data types). For example, recursive Elog pro- language called HEL. Parts of the query can be generated grams may follow a “next” button and navigate to further using a visual extraction wizard which is limited to return- pages during extracting, while extracting instances of the ing the full DOM tree path of an element. However, the full same patterns. See Fig. 10 for extending the eBay exam- query must be programmed by the user manually. Hence, ple of Fig. 8 to follow a “next” button, and extract for W4F requires expertise with both HEL and HTML. HEL each page the same kind of information. In this example, requires tricky use of index variables and fork constructs the pattern has an initial filter which uses the to correctly describe a complex pattern structure. XWrap user-provided page ($1), and an additional filter, which uses uses a procedural rule system and provides limited expres- as parent pattern (whose instances are strings sive power for pattern definition. The user cannot label representing URLs). Web crawling and recursion in Lixto regions in documents as flexible as in Lixto. XWrap lacks is described in more detail in [4]. Currently we are extend- visual facilities for imposing external or internal conditions ing the interactive pattern builder to cover these aspects. to a pattern, but instead is rather template-based. The Furthermore, a server-based Lixto version is currently being division into two description levels and the automatic hier- implemented – it uses simple web interfaces and works in the archical structure extractor severely limit the ways to define user’s favourite browser. Future work focuses on automation extraction patterns (e.g. it is impossible to describe pattern heuristics for optional use, including to work on multiple ex- disjunctions). Hence, in general, other supervised wrapper ample targets at once. Additionally, Lixto wrappers will be embedded into a personalisable information channel system. next(S, X) ← document( , S), subelem(S, (?.content, [(a, , substr), (elementtext, Next, exact)]), X) nexturl(S, X) ← next( , S), subatt(S, href, X) document(S, X) ← getDocument($1, X) document(S, X) ← nexturl( , S), getDocument(S, X) Figure 10: Recursive Extension of the Elog program of Figure 8 8. REFERENCES wrapper construction system for internet information. [1] S. Abiteboul, P. Buneman, and D. Suciu. Data on the In Proc. ICDE, 2000. Web - From Relations to Semistructured Data and [19] W. May, R. Himmeröder, G. Lausen, and XML. Morgan Kaufmann, 2000. B. Ludäscher. A unified framework for wrapping, [2] B. Adelberg. NoDoSE - a tool for semi-automatically mediating and restructuring information from the extracting semi-structured data from text documents. web. In WWWCM. Sprg. LNCS 1727, 1999. In Proc. SIGMOD, 1998. [20] B. McLaughlin and J. Hunter. jdom.org Package. [3] P. Atzeni and G. Mecca. Cut and paste. In Proc. http://www.jdom.org/. PODS, 1997. [21] E. T. Mueller. Natural language processing with [4] R. Baumgartner, S. Flesca, and G. Gottlob. ThoughtTreasure. Signiform, 1998. Declarative information extraction, web crawling and [22] I. Muslea, S. Minton, and C. Knoblock. A hierarchical recursive wrapping with Lixto. Proc. LPNMR, 2001. approach to wrapper induction. In Proc. 3rd Intern. [5] R. Baumgartner, S. Flesca, and G. Gottlob. Conf. on Autonomous Agents, 1999. Supervised wrapper generation with Lixto. Proc. [23] A. Sahuguet and F. Azavant. Building light-weight VLDB Demo, 2001. wrappers for legacy web data-sources using W4F. In [6] R. Baumgartner, S. Flesca, and G. Gottlob. Visual Proc. VLDB, 1999. web information extraction with Lixto. In Proc. [24] D.F. Savarese. OROmatcher - Regular Expressions for VLDB, 2001. Java. http://www.savarese.org/oro/. [7] D. Cabeza and M. Hermenegildo. Distributed WWW [25] Starlab. http://www.starlab.vub.ac.be/research/ programming using (Ciao-)Prolog and the PiLLoW dogma/ontologyserver.htm. library. TPLP, 1(3), 2001. [26] M. Stonebraker and J. Hellerstein. Content integration [8] S. Ceri, S. Comai, E. Damiani, P. Fraternali, for e-business. In Proc. Sigmod, 2001. S. Paraboschi, and L. Tanca. XML-GL: a graphical [27] B. Thomas. Anti-unification based learning of query language for querying and restructuring XML T-wrappers for information extraction. In Workshop documents. In Proc. WWW Conf., 1999. on Machine Learning for IE, 1999. [9] D. Chamberlin and al. (Eds.). XQuery: A query language for XML. http://www.w3.org, 2001. [10] H. Davulcu, G. Yang, M. Kifer, and I.V. Ramakrishnan. Computat. aspects of resilient data extract. from semistr. sources. In Proc. PODS, 2000. [11] D. Florescu, A. Deutsch, A. Levy, D. Suciu, and M. Fernández. A query language for XML. In Proc. 8th Intern. WWW Conference, 1999. [12] J. Hammer, H. Garcia-Molina, J. Cho, R. Aranha, and A. Crespo. Extracting semistructured information from the web. In Proc. Workshop on Mang. of Semistructured Data, 1997. [13] C-N. Hsu and M.T. Dung. Generating finite-state transducers for semistructured data extraction from the web. Information Syst., 23/8, 1998. [14] G. Huck, P. Fankhauser, K. Aberer, and E.J. Neuhold. JEDI: Extracting and synthesizing information from the web. In Proc. COOPIS, IEEE CS Press, 1998. [15] N. Kushmerick. Wrapper verification. World Wide Web Journal, 2000. [16] N. Kushmerick, D. Weld, and R. Doorenbos. Wrapper induction for information extraction. In Proc. IJCAI, 1997. [17] A.Y. Levy and D.S. Weld. Intelligent internet systems. Artificial Intelligence, 118(1-2), 2000. [18] L. Liu, C. Pu, and W. Han. XWrap: An extensible