Large scale acquisition and maintenance from the web
                   without source access

                      Thomas Leonard                                                 Hugh Glaser
                    tal00r@ecs.soton.ac.uk                                        hg@ecs.soton.ac.uk
                                                  University of Southampton
                                                 Southampton SO17 1BJ UK


ABSTRACT                                                          INTRODUCTION
Although different web sites structure their pages differently,   Recent interest in the semantic web, Tim Berners-Lee and
the pages within a single site are often generated from a         others’ vision to make web pages’ inherent knowledge di-
database and have a regular layout from which it is possible      rectly accessible to machines, has produced a desire for know-
to extract information automatically.                             ledge extraction systems which work on existing web pages.
Dome is a visual tool for manipulating tree-structured docu-      While, in the longer term, Natural Language Processing
ments. It can import and export in XML or HTML formats,           (NLP) tools of great complexity are needed, these tools do
making it ideal for harvesting information from web pages.        not yet exist. In the medium term, or when a high level of
Editing is performed using a direct manipulation interface        confidence in the accuracy of the results is required, a more
and the operations are recorded for later playback.               ‘programmed’ approach can be used.
The knowledge extracted from a web page may be updated            There are a number of tools available for specifying the au-
by replaying the recorded sequence when the source page           tomatic extraction of knowledge from web pages, but they
changes. The same sequence can be applied to other pages          usually require the user to enter complex query commands.
with a similar format, and facilities are provided to batch
                                                                  For example, Web-OEM allows HyperText Markup Lan-
process a large collection of pages in one operation.
                                                                  guage[7] (HTML) documents to be queried like a relational
In this paper we describe how Dome may be used to extract         database, using Structured Query Language (SQL) syntax.
knowledge from web sites in such a way that the extraction        It also provides a mechanism to create Extensible Markup
process may be reliably replayed.                                 Language[6] (XML) files from the results by specifying a
                                                                  template, as in this example (taken from [2]):
Keywords
Knowledge acquisition tools, Programming by demonstra-            CONSTRUCT<EMAIL>x1.text</EMAIL><TEL>x2.text</TEL>
tion systems, Programming by example, Visual languages,           FROM Page:p, Table:t, Text:x1, Text:x2
XML editors                                                       WHERE p.title="My home page"    AND
                                                                          t IN p.structures.*     AND
                                                                          x1=t.row[0].elements[1] AND
                                                                          x2=t.row[2].elements[1]

                                                                  In this paper, we describe a visual tool which can perform
                                                                  such tasks easily using direct manipulation, while still al-
                                                                  lowing the operation to be replayed later.
                                                                  In particular, we will show how knowledge may be extracted
                                                                  from an entire site, and how that knowledge can be kept up-
                                                                  to-date.

                                                                  DOME
                                                                  Dome is a visual language which focuses on manipulation
                                                                  of tree-structured data. This makes it ideal for processing
                                                                  XML and HTML documents.
                                                                  The program may be used simply as an editor, and supports
                                                                  the familiar editor operations such as cut, copy, and paste.
                                                                  Once the editing of documents using these direct manip-
                                                                  ulation operations is mastered, the user may easily string
                                                                  operations together to form programs.
                                                                  The main window is divided into three parts (see figure 1):
Figure 1: Dome’s main window, showing a page from our department’s site. The darker nodes in the
document (HEAD, TABLE and P) are collapsed nodes — this feature allows unimportant elements to be hidden,
reducing screen clutter.


The Document The main area, on the right, shows the               ways of selecting a node in the document:
    data that the user is editing. In our case, this is the
    HTML of the web page, showing its tree structure.               1. A structured relative move is performed for any node
     The layout should be readable to anyone who knows                 clicked on. Dome records the operation as an XPath[5]
     HTML. A vertical line represents a sibling relation-              which will select that node relative to the current node.
     ship between nodes, while a diagonal line indicates a             For example, “Move to the first cell of the next row.”.
     parent–child relationship.
                                                                    2. A non-structured text search — for example “Find the
     The single exception to this rule is the ‘TR’ element,            word ‘Name:’ anywhere in the page.”.
     which is used to create a row of cells in HTML. Dome
     lays out the child nodes of a TR element horizontally          3. A structured search which also requires a literal match
     to save space and to make it look closer to the way it            of the text of the node clicked — “Move to the first cell
     appears in a browser.                                             of the next row, which must contain the text ‘Name:’.”.
                                                                       This is done using a vendor extension of the XPath
The Programs List Each sequence of operations that the                 syntax.
    user has recorded is displayed in the top-left corner
    of the screen. The programs can be organised into a
                                                                  Although all three methods may be used to select the same
    hierarchy if there are a large number of them.
                                                                  node, choosing the correct method is crucial to making the
     The tree of collapsable nodes behaves like the directory     operation replayable.
     list in Microsoft’s Explorer program.
                                                                  The first is the easiest and is quite tolerant of changes to
The Program Display The operations of the currently               document structure. It is sufficient for many purposes, es-
    selected program are displayed below the program list.        pecially if the document’s structure is unlikely to change.
     This is a control flow diagram — control normally            Either of the other two may be performed first to make
     passes downwards along the dark lines. The fainter           the search more reliable or more strict. Consider a table
     diagonal lines are used when execution of a operation        row containing two cells: the literal string “Name:” and
     fails for some reason. A dotted line (as seen in figure 2)   the name itself. By using method 3 to select the literal
     indicates a breakpoint, where execution stops to allow       string and then using method 1 to select the name itself,
     the user to examine the state of the system.                 the recorded sequence will not be fooled by a table with a
     The user can also use this area to correct mistakes in       new first row – it will fail with an error instead of selecting
     recordings and to record alternative cases.                  the wrong node.
                                                                  By contrast, using the second method to search for the string
The most important operation for our purposes is that of          “Name:” and then using method 1 to select the actual name
selecting a piece of information. There are three common          will still work correctly even if a new row is added. However,
Figure 2: Dome in the middle of processing a research group’s site. A page containing links to all members
of the group has been automatically fetched and the links extracted from it. Each link in turn is replaced
by the knowledge extracted from the linked page. The program displayed is the ‘enter–fetch–process–leave’
program — it was stopped for the screenshot by setting a breakpoint (the dotted line) while the program
was running.


it is also more susceptible to selecting the wrong node alto-     is extracted separately.
gether if “Name:” appears somewhere else in the document.         To extract information from a similar page, the user may
                                                                  load the page in and click on each program in turn to run
PROCESSING ONE PAGE                                               it. Once confident with the function of each program, the
                                                                  user will normally start recording a new program and then
In a typical editing session, the user will load a web page
                                                                  click on each of the previous programs in turn to create a
from the site of interest into Dome. Then, for each piece of
                                                                  master program that processes a whole page in one go.
information that needs to be extracted, they will record a
program to extract that information.
                                                                  PROCESSING A WHOLE SITE
For example, if the aim is to collect product details, the user
may create a program called ‘Name’ by performing whatever         When processing a whole site, two extra features of Dome
actions are required to extract the product’s name. This is       are useful:
often as simple as scrolling down to find the name, selecting
                                                                     • Dome includes facilities to fetch a page referenced by
it, and then using copy and paste to bring it to the top of
                                                                       a Universal Resource Identifier (URI) in a document.
the document, perhaps placing it under a new element node
                                                                       It does this by replacing the anchor element node (A,
called ‘Name’.
                                                                       for example) with the contents of the page fetched.
The process will then be repeated to create programs called
‘Price’, ‘Order code’ and so on. Once all the data have been         • Dome allows a subnode in the document to be treated,
collected, the rest of the document is deleted, leaving a neat         temporarily, as the document root (called ‘entering’
XML record to be saved out.                                            the node). ‘Leaving’ the node returns to the previous
                                                                       root node.
Although it is possible to record all the actions in a single
program, we find that it is easier to cope with errors (such      To process an entire site, the following steps are typically
as a product with no order code) if each piece of information     used:
                               Figure 3: The extracted knowledge, converted to RDF.


  1. Load a page which references all the subpages to be         In making the system more robust to changes in the struc-
     harvested.                                                  ture of the document there are two points to consider:
  2. Record a program which extracts all the relevant ‘an-          • Making sure that any significant change is detected
     chor’ nodes from that document.                                  and reported to the user. The system should not sim-
                                                                      ply generate incorrect output. This is best achieved
  3. Select the first node and record a program which en-             using a structured-literal search, as discussed previ-
     ters the node, fetches the HTML document, runs the               ously.
     program which processes one page, and then leaves
     the node. This has the effect of replacing the reference       • Handling structural changes when they are detected.
     to the page with the information extracted from the              In this case, the program will fail and execution stops
     page.                                                            at the point of failure. Dome displays the steps of the
                                                                      program that failed and asks the user if they would like
  4. Select the remaining nodes and ‘map’ the previous
                                                                      to record a ‘failure case’. The user agrees and proceeds
     program (Dome will run the enter–fetch–process–leave
                                                                      to take the required actions (perhaps by selecting the
     program on each of the selected nodes).
                                                                      ‘out-of-stock’ text, instead of the missing order code
                                                                      element, and bringing that to the top).
This generates an XML document which is a list of pages
and their extracted information.                                 In this way, the user builds up a list of exceptions which
When each subpage is fetched, Dome records the URI it            allow Dome to process the entire site.
used by adding a ‘uri’ attribute to the new element. This
is done mainly to allow relative URIs within the fetched         EXPORTING THE RESULTS
document to be resolved, but for our purposes it means that      Dome can be used to export the results in a variety of for-
each record in the XML file can be used just like the original   mats. If some format other than plain XML is required,
anchor — that is, we can rerun the ‘map’ operation, without      another program may be used to convert to that format
any modifications, to update every record.                       (still using Dome). More usefully, several programs may
This is useful if extracting the anchor nodes had to be done     be employed to export the same knowledge in a variety of
manually. If processing the index document is trivial then       formats.
it is, of course, better to run the whole thing again from the   For example, it is very easy to convert a list of XML records
start to cope with newly added or removed pages.                 into an HTML table. Add the required HTML elements
                                                                 (HTML, HEAD, BODY, etc) and then use Dome’s save-as-HTML
ROBUSTNESS                                                       feature to create a document ready to publish on the web.
It may be that, while processing a site, Dome hits a page        For use in knowledge systems, records may be converted to
which has a structure different from that expected. For ex-      Resource Description Framework[4] (RDF) format, as shown
ample, a product which has no order code (perhaps because        in figure 3, perhaps using a semantic vocabulary such as
it is out of stock).                                             Dublin Core[1].
CURRENT STATUS                                                 Some aspects of Dome may be improved — for example,
Dome is a research prototype, currently implemented in the     there is potential for a considerable speed increase if web
Python[3] programming language, on Linux. It uses the          pages could be retrieved in parallel with processing opera-
GTK[8] toolkit for the user interface, and the 4Suite XML      tions.
tools[9].                                                      Object-oriented features may be added at some point, so
It has already been successfully used to extract information   that ‘programs’ become ‘methods’ that work on a class hi-
about researchers from a number of UK sites. The exam-         erarchy of element tags. While this is not immediately use-
ples in this document are taken from the web site of one       ful for HTML, it will improve Dome’s ability to handle the
of our department’s groups. As a rough speed guide, ex-        structured XML records produced from the HTML.
tracting personal details from the 122 individual web-pages    Even in its current state, we feel that Dome is already a
linked from the group’s ‘Complete List of People’ page takes   useful tool for anyone wishing to process web pages in a
around 20 minutes on a typical desktop system.                 structured and repeatable way.
Much of this time is spent in network communication and
in importing the HTML, which is done in two stages. The        1.   REFERENCES
HTML is first piped through the Web Consortium’s ‘Tidy’        [1] The Dublin Core Metadata Initiative Available
program to correct broken HTML, then the result is parsed          at http://dublincore.org/.
using the 4Suite tools.                                        [2] Iocchi, Luca. The Web-OEM approach to Web
                                                                   information extraction. Journal of Network and
CONCLUSIONS AND FUTURE WORK                                        Computer Applications (1999) 22, 259–269.
In this paper we have shown how Dome may be used to ex-        [3] The Python programming language.
tract information from web pages into appropriately format-        http://www.python.org/.
ted XML documents. We have seen how to process many            [4] The World Wide Web Consortium. Resource
pages automatically and we have looked at ways of mak-             Description Framework.
ing the extraction process robust to changes in document           Available at http://www.w3.org/RDF/.
structure.                                                     [5] The World Wide Web Consortium. XML Path
                                                                   Language (XPath).
There are several other areas where parsing structured web
                                                                   Available at http://www.w3c.org/TR/xpath.
pages is useful. Metasearchers search the web by querying
many other search engines and combining the results, but       [6] The World Wide Web Consortium. Extensible Markup
since they may have to perform millions of searches a day,         Language (XML).
speed requirements dictate the use of hand-coded parsers.          Available at http://www.w3.org/XML/.
However, Dome is well-suited to tasks such as creating a       [7] The World Wide Web Consortium. HyperText Markup
news roundup by taking headlines from a number of other            Language. Available at http://www.w3.org/MarkUp/.
sites, as this only needs to be done every few minutes.        [8] The GIMP Toolkit. http://www.gtk.org/.
                                                               [9] Fourthought, Inc. Open source XML processing tools.
                                                                   Available at http://4Suite.org/.