Large scale acquisition and maintenance from the web without source access Thomas Leonard Hugh Glaser tal00r@ecs.soton.ac.uk hg@ecs.soton.ac.uk University of Southampton Southampton SO17 1BJ UK ABSTRACT INTRODUCTION Although different web sites structure their pages differently, Recent interest in the semantic web, Tim Berners-Lee and the pages within a single site are often generated from a others’ vision to make web pages’ inherent knowledge di- database and have a regular layout from which it is possible rectly accessible to machines, has produced a desire for know- to extract information automatically. ledge extraction systems which work on existing web pages. Dome is a visual tool for manipulating tree-structured docu- While, in the longer term, Natural Language Processing ments. It can import and export in XML or HTML formats, (NLP) tools of great complexity are needed, these tools do making it ideal for harvesting information from web pages. not yet exist. In the medium term, or when a high level of Editing is performed using a direct manipulation interface confidence in the accuracy of the results is required, a more and the operations are recorded for later playback. ‘programmed’ approach can be used. The knowledge extracted from a web page may be updated There are a number of tools available for specifying the au- by replaying the recorded sequence when the source page tomatic extraction of knowledge from web pages, but they changes. The same sequence can be applied to other pages usually require the user to enter complex query commands. with a similar format, and facilities are provided to batch For example, Web-OEM allows HyperText Markup Lan- process a large collection of pages in one operation. guage[7] (HTML) documents to be queried like a relational In this paper we describe how Dome may be used to extract database, using Structured Query Language (SQL) syntax. knowledge from web sites in such a way that the extraction It also provides a mechanism to create Extensible Markup process may be reliably replayed. Language[6] (XML) files from the results by specifying a template, as in this example (taken from [2]): Keywords Knowledge acquisition tools, Programming by demonstra- CONSTRUCTx1.textx2.text tion systems, Programming by example, Visual languages, FROM Page:p, Table:t, Text:x1, Text:x2 XML editors WHERE p.title="My home page" AND t IN p.structures.* AND x1=t.row[0].elements[1] AND x2=t.row[2].elements[1] In this paper, we describe a visual tool which can perform such tasks easily using direct manipulation, while still al- lowing the operation to be replayed later. In particular, we will show how knowledge may be extracted from an entire site, and how that knowledge can be kept up- to-date. DOME Dome is a visual language which focuses on manipulation of tree-structured data. This makes it ideal for processing XML and HTML documents. The program may be used simply as an editor, and supports the familiar editor operations such as cut, copy, and paste. Once the editing of documents using these direct manip- ulation operations is mastered, the user may easily string operations together to form programs. The main window is divided into three parts (see figure 1): Figure 1: Dome’s main window, showing a page from our department’s site. The darker nodes in the document (HEAD, TABLE and P) are collapsed nodes — this feature allows unimportant elements to be hidden, reducing screen clutter. The Document The main area, on the right, shows the ways of selecting a node in the document: data that the user is editing. In our case, this is the HTML of the web page, showing its tree structure. 1. A structured relative move is performed for any node The layout should be readable to anyone who knows clicked on. Dome records the operation as an XPath[5] HTML. A vertical line represents a sibling relation- which will select that node relative to the current node. ship between nodes, while a diagonal line indicates a For example, “Move to the first cell of the next row.”. parent–child relationship. 2. A non-structured text search — for example “Find the The single exception to this rule is the ‘TR’ element, word ‘Name:’ anywhere in the page.”. which is used to create a row of cells in HTML. Dome lays out the child nodes of a TR element horizontally 3. A structured search which also requires a literal match to save space and to make it look closer to the way it of the text of the node clicked — “Move to the first cell appears in a browser. of the next row, which must contain the text ‘Name:’.”. This is done using a vendor extension of the XPath The Programs List Each sequence of operations that the syntax. user has recorded is displayed in the top-left corner of the screen. The programs can be organised into a Although all three methods may be used to select the same hierarchy if there are a large number of them. node, choosing the correct method is crucial to making the The tree of collapsable nodes behaves like the directory operation replayable. list in Microsoft’s Explorer program. The first is the easiest and is quite tolerant of changes to The Program Display The operations of the currently document structure. It is sufficient for many purposes, es- selected program are displayed below the program list. pecially if the document’s structure is unlikely to change. This is a control flow diagram — control normally Either of the other two may be performed first to make passes downwards along the dark lines. The fainter the search more reliable or more strict. Consider a table diagonal lines are used when execution of a operation row containing two cells: the literal string “Name:” and fails for some reason. A dotted line (as seen in figure 2) the name itself. By using method 3 to select the literal indicates a breakpoint, where execution stops to allow string and then using method 1 to select the name itself, the user to examine the state of the system. the recorded sequence will not be fooled by a table with a The user can also use this area to correct mistakes in new first row – it will fail with an error instead of selecting recordings and to record alternative cases. the wrong node. By contrast, using the second method to search for the string The most important operation for our purposes is that of “Name:” and then using method 1 to select the actual name selecting a piece of information. There are three common will still work correctly even if a new row is added. However, Figure 2: Dome in the middle of processing a research group’s site. A page containing links to all members of the group has been automatically fetched and the links extracted from it. Each link in turn is replaced by the knowledge extracted from the linked page. The program displayed is the ‘enter–fetch–process–leave’ program — it was stopped for the screenshot by setting a breakpoint (the dotted line) while the program was running. it is also more susceptible to selecting the wrong node alto- is extracted separately. gether if “Name:” appears somewhere else in the document. To extract information from a similar page, the user may load the page in and click on each program in turn to run PROCESSING ONE PAGE it. Once confident with the function of each program, the user will normally start recording a new program and then In a typical editing session, the user will load a web page click on each of the previous programs in turn to create a from the site of interest into Dome. Then, for each piece of master program that processes a whole page in one go. information that needs to be extracted, they will record a program to extract that information. PROCESSING A WHOLE SITE For example, if the aim is to collect product details, the user may create a program called ‘Name’ by performing whatever When processing a whole site, two extra features of Dome actions are required to extract the product’s name. This is are useful: often as simple as scrolling down to find the name, selecting • Dome includes facilities to fetch a page referenced by it, and then using copy and paste to bring it to the top of a Universal Resource Identifier (URI) in a document. the document, perhaps placing it under a new element node It does this by replacing the anchor element node (A, called ‘Name’. for example) with the contents of the page fetched. The process will then be repeated to create programs called ‘Price’, ‘Order code’ and so on. Once all the data have been • Dome allows a subnode in the document to be treated, collected, the rest of the document is deleted, leaving a neat temporarily, as the document root (called ‘entering’ XML record to be saved out. the node). ‘Leaving’ the node returns to the previous root node. Although it is possible to record all the actions in a single program, we find that it is easier to cope with errors (such To process an entire site, the following steps are typically as a product with no order code) if each piece of information used: Figure 3: The extracted knowledge, converted to RDF. 1. Load a page which references all the subpages to be In making the system more robust to changes in the struc- harvested. ture of the document there are two points to consider: 2. Record a program which extracts all the relevant ‘an- • Making sure that any significant change is detected chor’ nodes from that document. and reported to the user. The system should not sim- ply generate incorrect output. This is best achieved 3. Select the first node and record a program which en- using a structured-literal search, as discussed previ- ters the node, fetches the HTML document, runs the ously. program which processes one page, and then leaves the node. This has the effect of replacing the reference • Handling structural changes when they are detected. to the page with the information extracted from the In this case, the program will fail and execution stops page. at the point of failure. Dome displays the steps of the program that failed and asks the user if they would like 4. Select the remaining nodes and ‘map’ the previous to record a ‘failure case’. The user agrees and proceeds program (Dome will run the enter–fetch–process–leave to take the required actions (perhaps by selecting the program on each of the selected nodes). ‘out-of-stock’ text, instead of the missing order code element, and bringing that to the top). This generates an XML document which is a list of pages and their extracted information. In this way, the user builds up a list of exceptions which When each subpage is fetched, Dome records the URI it allow Dome to process the entire site. used by adding a ‘uri’ attribute to the new element. This is done mainly to allow relative URIs within the fetched EXPORTING THE RESULTS document to be resolved, but for our purposes it means that Dome can be used to export the results in a variety of for- each record in the XML file can be used just like the original mats. If some format other than plain XML is required, anchor — that is, we can rerun the ‘map’ operation, without another program may be used to convert to that format any modifications, to update every record. (still using Dome). More usefully, several programs may This is useful if extracting the anchor nodes had to be done be employed to export the same knowledge in a variety of manually. If processing the index document is trivial then formats. it is, of course, better to run the whole thing again from the For example, it is very easy to convert a list of XML records start to cope with newly added or removed pages. into an HTML table. Add the required HTML elements (HTML, HEAD, BODY, etc) and then use Dome’s save-as-HTML ROBUSTNESS feature to create a document ready to publish on the web. It may be that, while processing a site, Dome hits a page For use in knowledge systems, records may be converted to which has a structure different from that expected. For ex- Resource Description Framework[4] (RDF) format, as shown ample, a product which has no order code (perhaps because in figure 3, perhaps using a semantic vocabulary such as it is out of stock). Dublin Core[1]. CURRENT STATUS Some aspects of Dome may be improved — for example, Dome is a research prototype, currently implemented in the there is potential for a considerable speed increase if web Python[3] programming language, on Linux. It uses the pages could be retrieved in parallel with processing opera- GTK[8] toolkit for the user interface, and the 4Suite XML tions. tools[9]. Object-oriented features may be added at some point, so It has already been successfully used to extract information that ‘programs’ become ‘methods’ that work on a class hi- about researchers from a number of UK sites. The exam- erarchy of element tags. While this is not immediately use- ples in this document are taken from the web site of one ful for HTML, it will improve Dome’s ability to handle the of our department’s groups. As a rough speed guide, ex- structured XML records produced from the HTML. tracting personal details from the 122 individual web-pages Even in its current state, we feel that Dome is already a linked from the group’s ‘Complete List of People’ page takes useful tool for anyone wishing to process web pages in a around 20 minutes on a typical desktop system. structured and repeatable way. Much of this time is spent in network communication and in importing the HTML, which is done in two stages. The 1. REFERENCES HTML is first piped through the Web Consortium’s ‘Tidy’ [1] The Dublin Core Metadata Initiative Available program to correct broken HTML, then the result is parsed at http://dublincore.org/. using the 4Suite tools. [2] Iocchi, Luca. The Web-OEM approach to Web information extraction. Journal of Network and CONCLUSIONS AND FUTURE WORK Computer Applications (1999) 22, 259–269. In this paper we have shown how Dome may be used to ex- [3] The Python programming language. tract information from web pages into appropriately format- http://www.python.org/. ted XML documents. We have seen how to process many [4] The World Wide Web Consortium. Resource pages automatically and we have looked at ways of mak- Description Framework. ing the extraction process robust to changes in document Available at http://www.w3.org/RDF/. structure. [5] The World Wide Web Consortium. XML Path Language (XPath). There are several other areas where parsing structured web Available at http://www.w3c.org/TR/xpath. pages is useful. Metasearchers search the web by querying many other search engines and combining the results, but [6] The World Wide Web Consortium. Extensible Markup since they may have to perform millions of searches a day, Language (XML). speed requirements dictate the use of hand-coded parsers. Available at http://www.w3.org/XML/. However, Dome is well-suited to tasks such as creating a [7] The World Wide Web Consortium. HyperText Markup news roundup by taking headlines from a number of other Language. Available at http://www.w3.org/MarkUp/. sites, as this only needs to be done every few minutes. [8] The GIMP Toolkit. http://www.gtk.org/. [9] Fourthought, Inc. Open source XML processing tools. Available at http://4Suite.org/.