Search Computing Meets Data Extraction∗ Tim Furche, Giorgio Orsi Alessandro Bozzon, Chiara Pasini, Luca Oxford University, Department of Computer Tettamanti, Salvatore Vadacca Science, Wolfson Building, Parks Road, Oxford Politecnico di Milano, OX1 3QD Via Ponzio 34/5, 20133 Milano, Italy firstname.lastname@cs.ox.ac.uk firstname.lastname@elet.polimi.it ABSTRACT tunately, not addressed by current search engines. From a Thanks to the Web, access to an increasing wealth and vari- vast list of potential sources, it is left to the user to manually ety of information has become near instantaneous. To make extract and integrate the relevant data. informed decisions, however, we often need to access data The Search Computing (SeCo) project [1] aims at from many different sources and integrate different types of building concepts, algorithms, tools, and technologies to information. Manually collecting data from scores of web support complex Web queries, through a new paradigm sites and combining that data remains a daunting task. based on combining data extraction from distinct sources The ERC projects SeCo (Search Computing) and DIA- and data integration by means of specialized integration DEM (Domain-centric Intelligent Automated Data Extrac- engines. Web data is typically published in two ways: as tion Methodology) address two aspects of this problem: SeCo structured (and possibly linked) data accessible trough Web supports complex search processes drawing on data from APIs (e.g. SPARQL, YQL, etc.), and as unstructured re- multiple domains with a user interface capable of refining sources (i.e. Web pages), possibly accessible only through and exploring the search results. DIADEM aims to auto- user-interaction such as form filling or link navigation. matically extract structured data from a domain’s websites. Unstructured data is typically accessible to general- In this paper, we outline a first approach for integrating purpose search engines, which exploits traditional informa- SeCo and DIADEM. We discuss how to use the DIADEM tion retrieval techniques. To enable the consumption of such methodology to automatically turn nearly any website from data by automated processes, data accessible to humans a given domain into a SeCo search service. We describe how through existing Web interfaces needs to be transformed such services can be registered and exploited by the SeCo into structured information: therefore, there is the need for framework in combination with services from other domains data extraction tools (e.g. screen scrapers); unfortunately, (and possibly developed with other methodologies). the interactive nature of modern Web interfaces poses a big challenge, as the dynamic nature of these user interfaces, driven by client and server-side scripting, creates challenges 1. INTRODUCTION for automated processes to access this information. Recent years witnessed a paradigmatic shift in the way The DIADEM1 (Domain-centric Intelligent Automated people deal with information. The Web provides cheap Data Extraction Methodology) project aims at developing and ubiquitous access to an increasing wealth and variety domain-specific data extraction systems that take as input a of data. Yet, making informed decisions, which often re- URL of a Web site in a particular application domain, auto- quire complex and articulated information retrieval tasks matically explore the Web site, and deliver as output a struc- involving access to information from many different sources, tured data set containing all the relevant information present remains a daunting task. Queries such as “Retrieve jobs on that site. It is based on a novel, knowledge-driven ap- as Java Developer in the Silicon Valley, nearby affordable proach that combines low-level annotations with high-level fully-furnished flats, and close to good schools” are, unfor- domain knowledge and sophisticated analysis rules encoding common Web design patterns. The first prototype for the ∗The research leading to these results has received funding UK real-estate domain outperforms existing data extraction from the European Research Council under the European tools and validates the premise that with a thin layer of Community’s Seventh Framework Programme (FP7/2007– 2013) / ERC grant agreement no. 246858 (DIADEM) and domain-specific knowledge, nearly perfect automated data the 2008 Call for “IDEAS Advanced Grants” as part of the extraction is feasible. Search Computing (SeCo) project. Once a web site is analyzed, the DIADEM engine can pro- vide a one-time copy of all the data of that site, structured according to the provided schema. Alternatively, an extrac- Permission to make digital or hard copies of all or part of this work for tion expression, formulated in OXPath [2], can be returned Permission personal to make digital or classroom use isor hard copies granted withoutoffeeall provided or part ofthatthiscopies work are for that extracts all the data on-demand at high-speed. personal not made or classroom use or distributed is granted for profit without feeadvantage or commercial provided andthat that copies are copies not made bear or distributed this notice for citation and the full profit oron commercial advantage the first page. To copy and that copies otherwise, to 1.1 Motivations and Outline bear this notice republish, to postandon the servers full citation or toonredistribute the first page.to To copyrequires lists, otherwise, to prior republish, to post on servers or to redistribute to lists, requires prior specific As users get acquainted with on-line search and deci- specific permission and/or a fee.This paper was presented at sion support systems, their information needs evolve, their permission and/or a fee. This article was presented at: Very Large Data Search (VLDS) 2011. WORKSHOP NAME. 1 Copyright Copyright2011. 2011. diadem-project.info. ;#21& ;#$<./#& ?@,#$0& ?)7 #@,#$0 ,"F-.8G#$ "8#$ "8#$ 5.6".7 !"#$.#8 2-.#)0& !"#$%&>-()& ;#$<./#& +,,-./(0.1)& 9: (,,-./(0.1) =#3.)#B#)0& =#*.80$(0.1) 21)3.*"$(0.1)& 411- 411- 411- ./"#%"+'D'#"+/5,+ ("21'D' :#%," ./"#%"+'D'#"+/5,+ ?@0#$)(- +>:&&&H=?;4I E))5%&2,%*4 (")*+%,*#- :)0#$)(- +>: 0+"#'ABA,C ?#&@"+,#2,%*4 ("21'D' !"#$%&'()(*#$ !"#$%&"'ABA,C :#%," ./"#-'ABA,C 32&@" 98#$ A$(B#C1$D 0+"# 12,2' 32&@" #")*+%,*#- ./"#%"+ 82,2'#",#%"$25 ("21'D' !"#$% >-())#$ :#%," !"#$% A$(B#C1$D ./"#- 32&@" 32&@" (")*+%,*#- =>"&/,%*4')524+ ("21'D' ?@#/"0.1) ?)*.)# :#%," ./"#-' ;#$<./#&'($0 A$(B#C1$D !"#$%&"' 32&@" #"+/5,+ 32&@" (")*+%,*#- F"B"41 !"#$%&"'&255+ 3*4,#*5'1")"41"4&%"+'6/+"+7 !"#$%&"'&255'#"+/5,+ ;#$<./#&:)<1/(0.1) ;#$<./#8 82,2'95*:+'6;/"#%"+<'#"+/5,+7 A$(B#C1$D E 32&@" Fig. 1. Overview of the Search Computing framework Figure 1: The Search Computing architecture queries become more To and obtain morea specific complex,Search and theirComputing demand application, “Where theare general-purpose good schools?”) architecture and binding them to the re- of Figure 1 is customized with for correct and updated data increases. Whilst data extrac- the help of tools targeted to spective relevantprogrammers, data sourcesexpert users,in the service mart registered tion approaches suchand as endDIADEM users. can greatly improve the repository; starting from this mapping, the query planner quality of available• information, the need arises Service Publishers for systems register Service produces an optimized Mart definitions query the within execution serviceplan, which dictates and tools able to holistically tackle and repository, the problem declare ofthe complex connectionthepatterns sequenceusable of stepstoforjoin executing them. theThe query. Finally, the queries, while enabling users to select, explore and combine registration process is realized through a execution engine actually executes the query plan, by sub- Service Registration Tool that: 1) helps data sources in a customized way. A tight integration of mitting the service calls to designated services through the DIADEM and SeCo can theprovide publisher in the specification an answer to such need by of the SM, AP invocation service and SI attributes framework,and building parameters the query results by respectively and 2) it combining high-precision data extraction, multi-domain ser- hides to the user the Internal API, combining the outputs produced that allow the calls, comput- by service communication vice integration, and exploratory search between interactionthe [3]. services We andthethe ing engine global levels. ranking Theresults, of query serviceand producing the demonstrate how the data publishers extractionare in charge facilities provided of byimplementing mediators, query result outputs wrappers, in an order or thatdata reflects their global DIADEM enable the data integration performed materialization components, in SeCoso asto to make relevance. data sources compatible with the easily achieve novel, multi-domain Service Mart search services standard over large interface and expected behavior. number of Web sites. 3. applications, AUTOMATIC DATA EXTRACTION • Expert Users configure Search Computing by selecting the Service The paper is organized as follows: Section 2 describes the Marts of interest, by choosing search computing approach to information integration, Sec- a data source WITH supporting DIADEM the Service Mart, and by tion 3 presents the DIADEMconnecting them through approach connection patterns. to data extraction, They also configure A framework such as SeCo the complexity allows the user to search for of the user interface, in Section 4 discusses integration issues, Section 5 concludes terms of controls and configurability objects with a given choices specificationto be left to rather than just for poten- the paper. the end user. tially relevant Web documents as keyword search engines. • End Users use Search Computing applications To thatconfigured end, structured data isusers. by expert required, Theywhere objects and 2. WEB DATAinteract INTEGRATION WITH their attributes are described in a well understood schema. by submitting queries, inspecting results, and refining/evolving their Unfortunately, most commercial Web sites do not provide SEARCH COMPUTING information need according to an exploratory information their objects (such as seeking job listing,approach, properties, or products) Figure 1 shows an which overviewwe ofcallthe Liquid Search Query [4]. Computing as structured data. This is particularly true for businesses framework, whichSearch comprises Computing aims at building several sub-frameworks. The two withnew little communities of users: Content technical expertise. service descriptionproviders, frameworkwho (SDF) want provides the scaffold- to organize Automatically their content (now turning in the format existing of data Web sites into structured collections, ing for wrapping databases, and registering data sources in service data has been mostly an Web pages) in order to make it available for search access by third parties, unrealized dream in the past. Pre- marts, describing and the information sources at different levels vious approaches expert users, who want to offer new services built by composing domain-specificto fully-automated data extraction ad- of abstraction. The user framework provides functionality dressed the problem by investigating general techniques that content in order to go "beyond" general-purpose and storage for registering users, with different roles and ca- search engines such as Google and can be applied to any web site [4]. W.r.t. existing ap- pabilities. The query framework supports the management proaches, DIADEM is based on a fundamental observation: and storage of queries as first class citizens: a query can be if we combine knowledge about a domain (e.g., that a four executed, saved, modified, and published for other users to figure price is more likely a rent price than a sales price in see. The service invocation framework masks the technical real estate) with knowledge about the appearance of objects issues involved in the interaction with the service mart, e.g., and search facilities in that domain (phenomenology), we the Web service protocol and data caching issues. The core can automatically derive an extraction program for nearly of the framework aims at executing multi-domain queries. any web page in the domain. The resulting program pro- The query manager takes care of splitting the query into sub- duces high precision data, as we use domain knowledge to queries (e.g., “Which jobs as Java developer are available in improve recognition and alignment and to verify the extrac- the Silicon Valley?”, “Where are affordable, nearby flats?”, tion program based on ontological constraints [5]. Domain Knowledge Not only HTML. In many domains non-HTML data makes Annotation types & rules Attribute types & constraints Record types & constraints up a small, but significant part of the description of objects, usually as PDF documents, but sometimes just as bitmap Data images. Sometimes, this information is just supporting the Fact generation Page Phenomenological Attribute Segmentation Area URL & annotation Model Mapping Model Mapping Model structured data (e.g., the pictures of a car an auto-trading website); in other cases, however, these web resources carry additional information that is not present in the structured Figure 2: DIADEM’s result-page analysis data and therefore cannot be accessed by either traditional nor object search-engines. Data Area Model Page Model For instance, in almost all the UK real-estate Web sites, 1 Page 1 1 Node users cannot search for an apartment by energy efficiency precedes 1 text: string * or by size of the rooms despite this information is clearly * 0..1 1 present on the websites. The reason is that the energy ef- 1 Component Attribute Model ficiency of a house is published as an EPC (Energy Per- formance Certificate) chart2 and the sizes of the rooms are 1..* * Attribute 1 * Data Area 1 Separator Record published in the floor-plan images. Location Attribute The automated extraction of this data is non trivial since «creates» * Attribute Criterion … it might require computer vision and OCR techniques. DI- Record Constraint * multiplicity: string Price Attribute ADEM addresses this problem by exploiting the knowledge «refers to» of the domain to improve existing image and PDF/PS anal- Required Attribute Excluded Attribute ysis techniques. As an example, the structure of the EPC charts is standardized by a EU directive, therefore it is easy Figure 3: DIADEM’s result-page model to “reverse-engineer” their semantics. For PDF brochures, it is possible to adopt analysis techniques similar to those adopted for HTML, since the structure of such documents DIADEM operates in two modes: in the analysis mode is also reducible to few patterns that can be easily identified a web site is scrutinized to find relevant objects and search by an automatic analysis. forms and to understand how to extract all data from that site. In the extraction mode, this knowledge is used to ex- 4. TOWARD MULTI-DOMAIN, AUTO- tract all data at high speed, assuming that the site has not MATED WEB DATA CONSUMPTION changed fundamentally since the analysis. In analysis mode, DIADEM answers primarily three ques- Our approach for the integration of structured and un- tions: (1) How do we have to navigate the site (e.g., by structured Web data sources is based on a service-oriented clicking on links, following pagination links, etc.) to extract vision of the resources. The source integration operates at all the results? (2) Are there any forms to fill and how three levels: wrapping, registration, and invocation. to fill them to find all results? (3) How are result records Service wrapping consists in implementing appropriate and their attributes structured and displayed? For each of wrapping components that take care of invoking the ser- these questions, DIADEM uses both domain-independent vices and manipulating the input and output so as to be heuristics encoding typical web design patterns and domain- consistent with the formats expected by the integration plat- dependent clues and high-level knowledge to locate specific form. The SeCo platform natively supports generic Web objects and their attributes and to verify and align the re- services, relational databases, YQL services, SPARQL end- sulting structured data. Except for a thin browser interac- points, etc. However, the system is open to support addi- tion layer and some off-the-shelf machine learning tools, the tional data source types. whole process is encoded in logical rules maybe involving We suggest two ways for integrating DIADEM data probabilistic knowledge. sources into SeCo. In both cases, we assume that the schema Finally, all the collected models are passed to the OXPath used in SeCo matches (a fragment of) the domain ontology generator that uses simple heuristics to create a generalized used in DIADEM. The first, off-line approach extracts all OXPath expression for use in extraction mode. the data of a site contxtually with the analysis and stores it, To illustrate how DIADEM analyses a Web site, we fo- e.g., in an RDF database together with the domain ontology. cus on result page analysis (the third question), see Fig- This database can be accessed as any other SPARQL end- ure 2. First we extract the page model from a live render- point. The advantage of this approach is that it provides ing of the Web page. This model logically represents the very good query performance, but at the cost of storage DOM tree of the page along with information on the vi- and consistency. In domains with fast changing data, the sual rendering (e.g., CSS boxes), and linguistic annotations. database will often be outdated compared to the data on The information provided by the browser model is mainly the live web site. domain-independent (e.g., DOM structure and CSS boxes) This deficit is addressed by the on-line approach, where while some of the linguistic annotations are generated by an OXPath expression is generated by the DIADEM anal- domain-specific gazetteers and rules. In the next step, we ysis and that expression is executed to extract the data at locate mandatory attributes of the records that we expect query time. A slightly specialized OXPath invoker is needed to find on a web page of a given domain; then, we pro- for this approach, as it needs to store the OXPath expression ceed to the segmentation of the page into records through together with possible parameters for form filling. OXPath domain-independent heuristics. The identified records are returns the extracted data in XML or RDF format struc- then validated using a result-page model, see Figure 3. 2 wikipedia.org/wiki/Energy_Performance_Certificate CachingInvoker Invoker scheduler : ScheduledExecutorService cache : Cache wrappedInvoker CompositeInvoker RandomAccessInvoker scheduler : ScheduledExecutorService CustomInvoker HttpInvoker RDBInvoker path : String client : HTTPClient poolsize : Integer classLoaders : Map connectionPools : Map SPARQLInvoker YQLinvoker GoogleBaseInvoker OXPathInvoker endpointURL : String urlTemplate : String authorizationKey : String prefixes : Prefix [0..*] urlTemplate : String Figure 4: UML class diagram of the SeCo invokers tured according to the SeCo schema. The latter is ensured 5. CONCLUSIONS by the construction process in the analysis, where the SeCo Rich object search is one of the major challenges in Web schema in form of the high-level DIADEM ontology is used research. In this paper, we show how a combination of to verify the extraction expression. SeCo and DIADEM has the potential to address the ma- The disadvantage of this approach is that for large or jor challenges involved in object search: (1) the integration complex Web sites extraction may take too long for on-line of multi-domain data sources including an easy interface for queries. This can be slightly alleviated by the high-level formulating and refining expressive, multi-domain queries. caching provided in SeCo. In the future, we plan to investi- (2) the automatic extraction of highly accurate, structured gate techniques for incremental data extraction, where only data from most existing web sites. new data is extracted. This is also useful for the off-line We plan to further investigate the integration of SeCo approach if frequent updates are desired. and DIADEM. In particular, a further alignment of the Service description in SeCo is based on the registration of conceptual descriptions, access patterns, and service inter- services within the Service Description Framework model, faces would be useful. We are currently investigating the which describes services at three levels of abstraction: Ser- automatic extraction of rich access patterns and integrity vice Marts (abstractions of several Web services dealing with constraints from existing Web forms. We also plan to de- the same conceptual objects available on the Web such as velop techniques for incremental data extraction to allow the “flights”, “hotels”, “restaurants”), Access Patterns (a spe- wrapping of time-sensitive services. cific signature of the Service Mart with the characteriza- tion of each attribute as input, output, and/or ranking), and service interfaces (a description of the invocation inter- 6. REFERENCES [1] Ceri, S., Brambilla, M., eds.: Search Computing Trends face of an actual source service)—leading from the concep- and Developments. Volume 6585. Springer (2011) tual representation of Web objects to the implementation of search services. If we combine SeCo with DIADEM, we [2] Furche, T., Gottlob, G., Grasso, G., Schallhart, C., can easy instantiate service descriptions for any Website of Sellers, A.: Oxpath: A language for scalable, a domain. Starting from a description of conceptual objects memory-efficient data extraction from web applications. of a domain, shared between the SeCo service marts and In: VLDB. (2011) the DIADEM high-level ontology, DIADEM can automat- [3] Bozzon, A., Brambilla, M., Ceri, S., Fraternali, P.: ically recognize existing access patterns (by form analysis) Liquid query: multi-domain exploratory search on the and translate them into SeCo service descriptions. web. In: Proceedings of the 19th international Service execution is performed by an engine, which ex- conference on World wide web. WWW ’10, New York, ploits the Service Description Framework. The execution NY, USA, ACM (2010) 161–170 engine consists of a runtime (a Panta Rhei [6] interpreter [4] Kayed, M., Kayed, M., Girgis, M.R., Shaalan, K.F.: A able to translate an execution plan in a coordinated sequence survey of web information extraction systems. IEEE of service invocations) and a set of service invokers. Low- Transactions on Knowledge and Data Engineering level service invokers (one for each data source type, includ- 18(10) (2006) 1411–1428 ing the one for on-line DIADEM sources) are implemented [5] Furche, T., Gottlob, G., et al.: Real understanding of and follow the chain of responsibility pattern (see Figure 4). real estate forms. In: WIMS ’11, New York, NY, USA, There is no need for a special invoker for off-line DIADEM ACM (2011) 13:1–13:12 sources, as those reduce to SPARQL Invokers where the data [6] Braga, D., Corcoglioniti, F., Grossniklaus, M., Vadacca, is the result of the off-line extraction. An high-level caching S.: Panta rhei: Optimized and ranked data processing invoker wraps the sequence of low-level invokers to read re- over heterogeneous sources. In: ICSOC 2010. Volume sults from the cache. 6470 of Lecture Notes in Computer Science. Springer (2010) 715–716