=Paper=
{{Paper
|id=Vol-99/paper-11
|storemode=property
|title=Large Scale Acquisition and Maintenance from the Web without Source Access
|pdfUrl=https://ceur-ws.org/Vol-99/Thomas_Leonard-et-al.pdf
|volume=Vol-99
|dblpUrl=https://dblp.org/rec/conf/kcap/LeonardG01
}}
==Large Scale Acquisition and Maintenance from the Web without Source Access==
Large scale acquisition and maintenance from the web
without source access
Thomas Leonard Hugh Glaser
tal00r@ecs.soton.ac.uk hg@ecs.soton.ac.uk
University of Southampton
Southampton SO17 1BJ UK
ABSTRACT INTRODUCTION
Although different web sites structure their pages differently, Recent interest in the semantic web, Tim Berners-Lee and
the pages within a single site are often generated from a others’ vision to make web pages’ inherent knowledge di-
database and have a regular layout from which it is possible rectly accessible to machines, has produced a desire for know-
to extract information automatically. ledge extraction systems which work on existing web pages.
Dome is a visual tool for manipulating tree-structured docu- While, in the longer term, Natural Language Processing
ments. It can import and export in XML or HTML formats, (NLP) tools of great complexity are needed, these tools do
making it ideal for harvesting information from web pages. not yet exist. In the medium term, or when a high level of
Editing is performed using a direct manipulation interface confidence in the accuracy of the results is required, a more
and the operations are recorded for later playback. ‘programmed’ approach can be used.
The knowledge extracted from a web page may be updated There are a number of tools available for specifying the au-
by replaying the recorded sequence when the source page tomatic extraction of knowledge from web pages, but they
changes. The same sequence can be applied to other pages usually require the user to enter complex query commands.
with a similar format, and facilities are provided to batch
For example, Web-OEM allows HyperText Markup Lan-
process a large collection of pages in one operation.
guage[7] (HTML) documents to be queried like a relational
In this paper we describe how Dome may be used to extract database, using Structured Query Language (SQL) syntax.
knowledge from web sites in such a way that the extraction It also provides a mechanism to create Extensible Markup
process may be reliably replayed. Language[6] (XML) files from the results by specifying a
template, as in this example (taken from [2]):
Keywords
Knowledge acquisition tools, Programming by demonstra- CONSTRUCTx1.text x2.text
tion systems, Programming by example, Visual languages, FROM Page:p, Table:t, Text:x1, Text:x2
XML editors WHERE p.title="My home page" AND
t IN p.structures.* AND
x1=t.row[0].elements[1] AND
x2=t.row[2].elements[1]
In this paper, we describe a visual tool which can perform
such tasks easily using direct manipulation, while still al-
lowing the operation to be replayed later.
In particular, we will show how knowledge may be extracted
from an entire site, and how that knowledge can be kept up-
to-date.
DOME
Dome is a visual language which focuses on manipulation
of tree-structured data. This makes it ideal for processing
XML and HTML documents.
The program may be used simply as an editor, and supports
the familiar editor operations such as cut, copy, and paste.
Once the editing of documents using these direct manip-
ulation operations is mastered, the user may easily string
operations together to form programs.
The main window is divided into three parts (see figure 1):
Figure 1: Dome’s main window, showing a page from our department’s site. The darker nodes in the
document (HEAD, TABLE and P) are collapsed nodes — this feature allows unimportant elements to be hidden,
reducing screen clutter.
The Document The main area, on the right, shows the ways of selecting a node in the document:
data that the user is editing. In our case, this is the
HTML of the web page, showing its tree structure. 1. A structured relative move is performed for any node
The layout should be readable to anyone who knows clicked on. Dome records the operation as an XPath[5]
HTML. A vertical line represents a sibling relation- which will select that node relative to the current node.
ship between nodes, while a diagonal line indicates a For example, “Move to the first cell of the next row.”.
parent–child relationship.
2. A non-structured text search — for example “Find the
The single exception to this rule is the ‘TR’ element, word ‘Name:’ anywhere in the page.”.
which is used to create a row of cells in HTML. Dome
lays out the child nodes of a TR element horizontally 3. A structured search which also requires a literal match
to save space and to make it look closer to the way it of the text of the node clicked — “Move to the first cell
appears in a browser. of the next row, which must contain the text ‘Name:’.”.
This is done using a vendor extension of the XPath
The Programs List Each sequence of operations that the syntax.
user has recorded is displayed in the top-left corner
of the screen. The programs can be organised into a
Although all three methods may be used to select the same
hierarchy if there are a large number of them.
node, choosing the correct method is crucial to making the
The tree of collapsable nodes behaves like the directory operation replayable.
list in Microsoft’s Explorer program.
The first is the easiest and is quite tolerant of changes to
The Program Display The operations of the currently document structure. It is sufficient for many purposes, es-
selected program are displayed below the program list. pecially if the document’s structure is unlikely to change.
This is a control flow diagram — control normally Either of the other two may be performed first to make
passes downwards along the dark lines. The fainter the search more reliable or more strict. Consider a table
diagonal lines are used when execution of a operation row containing two cells: the literal string “Name:” and
fails for some reason. A dotted line (as seen in figure 2) the name itself. By using method 3 to select the literal
indicates a breakpoint, where execution stops to allow string and then using method 1 to select the name itself,
the user to examine the state of the system. the recorded sequence will not be fooled by a table with a
The user can also use this area to correct mistakes in new first row – it will fail with an error instead of selecting
recordings and to record alternative cases. the wrong node.
By contrast, using the second method to search for the string
The most important operation for our purposes is that of “Name:” and then using method 1 to select the actual name
selecting a piece of information. There are three common will still work correctly even if a new row is added. However,
Figure 2: Dome in the middle of processing a research group’s site. A page containing links to all members
of the group has been automatically fetched and the links extracted from it. Each link in turn is replaced
by the knowledge extracted from the linked page. The program displayed is the ‘enter–fetch–process–leave’
program — it was stopped for the screenshot by setting a breakpoint (the dotted line) while the program
was running.
it is also more susceptible to selecting the wrong node alto- is extracted separately.
gether if “Name:” appears somewhere else in the document. To extract information from a similar page, the user may
load the page in and click on each program in turn to run
PROCESSING ONE PAGE it. Once confident with the function of each program, the
user will normally start recording a new program and then
In a typical editing session, the user will load a web page
click on each of the previous programs in turn to create a
from the site of interest into Dome. Then, for each piece of
master program that processes a whole page in one go.
information that needs to be extracted, they will record a
program to extract that information.
PROCESSING A WHOLE SITE
For example, if the aim is to collect product details, the user
may create a program called ‘Name’ by performing whatever When processing a whole site, two extra features of Dome
actions are required to extract the product’s name. This is are useful:
often as simple as scrolling down to find the name, selecting
• Dome includes facilities to fetch a page referenced by
it, and then using copy and paste to bring it to the top of
a Universal Resource Identifier (URI) in a document.
the document, perhaps placing it under a new element node
It does this by replacing the anchor element node (A,
called ‘Name’.
for example) with the contents of the page fetched.
The process will then be repeated to create programs called
‘Price’, ‘Order code’ and so on. Once all the data have been • Dome allows a subnode in the document to be treated,
collected, the rest of the document is deleted, leaving a neat temporarily, as the document root (called ‘entering’
XML record to be saved out. the node). ‘Leaving’ the node returns to the previous
root node.
Although it is possible to record all the actions in a single
program, we find that it is easier to cope with errors (such To process an entire site, the following steps are typically
as a product with no order code) if each piece of information used:
Figure 3: The extracted knowledge, converted to RDF.
1. Load a page which references all the subpages to be In making the system more robust to changes in the struc-
harvested. ture of the document there are two points to consider:
2. Record a program which extracts all the relevant ‘an- • Making sure that any significant change is detected
chor’ nodes from that document. and reported to the user. The system should not sim-
ply generate incorrect output. This is best achieved
3. Select the first node and record a program which en- using a structured-literal search, as discussed previ-
ters the node, fetches the HTML document, runs the ously.
program which processes one page, and then leaves
the node. This has the effect of replacing the reference • Handling structural changes when they are detected.
to the page with the information extracted from the In this case, the program will fail and execution stops
page. at the point of failure. Dome displays the steps of the
program that failed and asks the user if they would like
4. Select the remaining nodes and ‘map’ the previous
to record a ‘failure case’. The user agrees and proceeds
program (Dome will run the enter–fetch–process–leave
to take the required actions (perhaps by selecting the
program on each of the selected nodes).
‘out-of-stock’ text, instead of the missing order code
element, and bringing that to the top).
This generates an XML document which is a list of pages
and their extracted information. In this way, the user builds up a list of exceptions which
When each subpage is fetched, Dome records the URI it allow Dome to process the entire site.
used by adding a ‘uri’ attribute to the new element. This
is done mainly to allow relative URIs within the fetched EXPORTING THE RESULTS
document to be resolved, but for our purposes it means that Dome can be used to export the results in a variety of for-
each record in the XML file can be used just like the original mats. If some format other than plain XML is required,
anchor — that is, we can rerun the ‘map’ operation, without another program may be used to convert to that format
any modifications, to update every record. (still using Dome). More usefully, several programs may
This is useful if extracting the anchor nodes had to be done be employed to export the same knowledge in a variety of
manually. If processing the index document is trivial then formats.
it is, of course, better to run the whole thing again from the For example, it is very easy to convert a list of XML records
start to cope with newly added or removed pages. into an HTML table. Add the required HTML elements
(HTML, HEAD, BODY, etc) and then use Dome’s save-as-HTML
ROBUSTNESS feature to create a document ready to publish on the web.
It may be that, while processing a site, Dome hits a page For use in knowledge systems, records may be converted to
which has a structure different from that expected. For ex- Resource Description Framework[4] (RDF) format, as shown
ample, a product which has no order code (perhaps because in figure 3, perhaps using a semantic vocabulary such as
it is out of stock). Dublin Core[1].
CURRENT STATUS Some aspects of Dome may be improved — for example,
Dome is a research prototype, currently implemented in the there is potential for a considerable speed increase if web
Python[3] programming language, on Linux. It uses the pages could be retrieved in parallel with processing opera-
GTK[8] toolkit for the user interface, and the 4Suite XML tions.
tools[9]. Object-oriented features may be added at some point, so
It has already been successfully used to extract information that ‘programs’ become ‘methods’ that work on a class hi-
about researchers from a number of UK sites. The exam- erarchy of element tags. While this is not immediately use-
ples in this document are taken from the web site of one ful for HTML, it will improve Dome’s ability to handle the
of our department’s groups. As a rough speed guide, ex- structured XML records produced from the HTML.
tracting personal details from the 122 individual web-pages Even in its current state, we feel that Dome is already a
linked from the group’s ‘Complete List of People’ page takes useful tool for anyone wishing to process web pages in a
around 20 minutes on a typical desktop system. structured and repeatable way.
Much of this time is spent in network communication and
in importing the HTML, which is done in two stages. The 1. REFERENCES
HTML is first piped through the Web Consortium’s ‘Tidy’ [1] The Dublin Core Metadata Initiative Available
program to correct broken HTML, then the result is parsed at http://dublincore.org/.
using the 4Suite tools. [2] Iocchi, Luca. The Web-OEM approach to Web
information extraction. Journal of Network and
CONCLUSIONS AND FUTURE WORK Computer Applications (1999) 22, 259–269.
In this paper we have shown how Dome may be used to ex- [3] The Python programming language.
tract information from web pages into appropriately format- http://www.python.org/.
ted XML documents. We have seen how to process many [4] The World Wide Web Consortium. Resource
pages automatically and we have looked at ways of mak- Description Framework.
ing the extraction process robust to changes in document Available at http://www.w3.org/RDF/.
structure. [5] The World Wide Web Consortium. XML Path
Language (XPath).
There are several other areas where parsing structured web
Available at http://www.w3c.org/TR/xpath.
pages is useful. Metasearchers search the web by querying
many other search engines and combining the results, but [6] The World Wide Web Consortium. Extensible Markup
since they may have to perform millions of searches a day, Language (XML).
speed requirements dictate the use of hand-coded parsers. Available at http://www.w3.org/XML/.
However, Dome is well-suited to tasks such as creating a [7] The World Wide Web Consortium. HyperText Markup
news roundup by taking headlines from a number of other Language. Available at http://www.w3.org/MarkUp/.
sites, as this only needs to be done every few minutes. [8] The GIMP Toolkit. http://www.gtk.org/.
[9] Fourthought, Inc. Open source XML processing tools.
Available at http://4Suite.org/.