=Paper=
{{Paper
|id=None
|storemode=property
|title=Search Computing Meets Data Extraction
|pdfUrl=https://ceur-ws.org/Vol-880/VLDS-p58-Furche.pdf
|volume=Vol-880
|dblpUrl=https://dblp.org/rec/conf/vlds/BozzonFOPTV11
}}
==Search Computing Meets Data Extraction==
Search Computing Meets Data Extraction∗
Tim Furche, Giorgio Orsi Alessandro Bozzon, Chiara Pasini, Luca
Oxford University, Department of Computer Tettamanti, Salvatore Vadacca
Science, Wolfson Building, Parks Road, Oxford Politecnico di Milano,
OX1 3QD Via Ponzio 34/5, 20133 Milano, Italy
firstname.lastname@cs.ox.ac.uk firstname.lastname@elet.polimi.it
ABSTRACT tunately, not addressed by current search engines. From a
Thanks to the Web, access to an increasing wealth and vari- vast list of potential sources, it is left to the user to manually
ety of information has become near instantaneous. To make extract and integrate the relevant data.
informed decisions, however, we often need to access data The Search Computing (SeCo) project [1] aims at
from many different sources and integrate different types of building concepts, algorithms, tools, and technologies to
information. Manually collecting data from scores of web support complex Web queries, through a new paradigm
sites and combining that data remains a daunting task. based on combining data extraction from distinct sources
The ERC projects SeCo (Search Computing) and DIA- and data integration by means of specialized integration
DEM (Domain-centric Intelligent Automated Data Extrac- engines. Web data is typically published in two ways: as
tion Methodology) address two aspects of this problem: SeCo structured (and possibly linked) data accessible trough Web
supports complex search processes drawing on data from APIs (e.g. SPARQL, YQL, etc.), and as unstructured re-
multiple domains with a user interface capable of refining sources (i.e. Web pages), possibly accessible only through
and exploring the search results. DIADEM aims to auto- user-interaction such as form filling or link navigation.
matically extract structured data from a domain’s websites. Unstructured data is typically accessible to general-
In this paper, we outline a first approach for integrating purpose search engines, which exploits traditional informa-
SeCo and DIADEM. We discuss how to use the DIADEM tion retrieval techniques. To enable the consumption of such
methodology to automatically turn nearly any website from data by automated processes, data accessible to humans
a given domain into a SeCo search service. We describe how through existing Web interfaces needs to be transformed
such services can be registered and exploited by the SeCo into structured information: therefore, there is the need for
framework in combination with services from other domains data extraction tools (e.g. screen scrapers); unfortunately,
(and possibly developed with other methodologies). the interactive nature of modern Web interfaces poses a big
challenge, as the dynamic nature of these user interfaces,
driven by client and server-side scripting, creates challenges
1. INTRODUCTION for automated processes to access this information.
Recent years witnessed a paradigmatic shift in the way The DIADEM1 (Domain-centric Intelligent Automated
people deal with information. The Web provides cheap Data Extraction Methodology) project aims at developing
and ubiquitous access to an increasing wealth and variety domain-specific data extraction systems that take as input a
of data. Yet, making informed decisions, which often re- URL of a Web site in a particular application domain, auto-
quire complex and articulated information retrieval tasks matically explore the Web site, and deliver as output a struc-
involving access to information from many different sources, tured data set containing all the relevant information present
remains a daunting task. Queries such as “Retrieve jobs on that site. It is based on a novel, knowledge-driven ap-
as Java Developer in the Silicon Valley, nearby affordable proach that combines low-level annotations with high-level
fully-furnished flats, and close to good schools” are, unfor- domain knowledge and sophisticated analysis rules encoding
common Web design patterns. The first prototype for the
∗The research leading to these results has received funding
UK real-estate domain outperforms existing data extraction
from the European Research Council under the European tools and validates the premise that with a thin layer of
Community’s Seventh Framework Programme (FP7/2007–
2013) / ERC grant agreement no. 246858 (DIADEM) and domain-specific knowledge, nearly perfect automated data
the 2008 Call for “IDEAS Advanced Grants” as part of the extraction is feasible.
Search Computing (SeCo) project. Once a web site is analyzed, the DIADEM engine can pro-
vide a one-time copy of all the data of that site, structured
according to the provided schema. Alternatively, an extrac-
Permission to make digital or hard copies of all or part of this work for tion expression, formulated in OXPath [2], can be returned
Permission
personal to make digital
or classroom use isor hard copies
granted withoutoffeeall provided
or part ofthatthiscopies
work are
for that extracts all the data on-demand at high-speed.
personal
not made or classroom use
or distributed is granted
for profit without feeadvantage
or commercial provided andthat that
copies are
copies
not made
bear or distributed
this notice for citation
and the full profit oron
commercial advantage
the first page. To copy and that copies
otherwise, to 1.1 Motivations and Outline
bear this notice
republish, to postandon
the servers
full citation
or toonredistribute
the first page.to To copyrequires
lists, otherwise, to
prior
republish, to post on servers or to redistribute to lists, requires prior specific
As users get acquainted with on-line search and deci-
specific permission and/or a fee.This paper was presented at sion support systems, their information needs evolve, their
permission and/or a fee. This article was presented at:
Very Large Data Search (VLDS) 2011.
WORKSHOP NAME. 1
Copyright
Copyright2011.
2011. diadem-project.info.
;#21& ;#$<./#& ?@,#$0& ?)7
#@,#$0 ,"F-.8G#$ "8#$ "8#$
5.6".7 !"#$.#8 2-.#)0&
!"#$%&>-()& ;#$<./#& +,,-./(0.1)& 9: (,,-./(0.1)
=#3.)#B#)0& =#*.80$(0.1) 21)3.*"$(0.1)&
411- 411- 411- ./"#%"+'D'#"+/5,+ ("21'D'
:#%,"
./"#%"+'D'#"+/5,+ ?@0#$)(- +>:&&&H=?;4I E))5%&2,%*4
(")*+%,*#-
:)0#$)(- +>:
0+"#'ABA,C
?#&@"+,#2,%*4
("21'D'
!"#$%&'()(*#$
!"#$%&"'ABA,C
:#%,"
./"#-'ABA,C
32&@" 98#$ A$(B#C1$D 0+"# 12,2'
32&@" #")*+%,*#-
./"#%"+
82,2'#",#%"$25
("21'D'
!"#$% >-())#$ :#%,"
!"#$% A$(B#C1$D ./"#-
32&@" 32&@" (")*+%,*#-
=>"&/,%*4')524+
("21'D'
?@#/"0.1) ?)*.)# :#%,"
./"#-' ;#$<./#&'($0 A$(B#C1$D !"#$%&"'
32&@" #"+/5,+ 32&@" (")*+%,*#-
F"B"41
!"#$%&"'&255+
3*4,#*5'1")"41"4&%"+'6/+"+7
!"#$%&"'&255'#"+/5,+ ;#$<./#&:)<1/(0.1) ;#$<./#8
82,2'95*:+'6;/"#%"+<'#"+/5,+7 A$(B#C1$D E
32&@"
Fig. 1. Overview of the Search Computing framework
Figure 1: The Search Computing architecture
queries become more To and
obtain
morea specific
complex,Search
and theirComputing
demand application,
“Where theare general-purpose
good schools?”) architecture
and binding them to the re-
of Figure 1 is customized with
for correct and updated data increases. Whilst data extrac- the help of tools targeted to
spective relevantprogrammers,
data sourcesexpert users,in the service mart
registered
tion approaches suchand as
endDIADEM
users. can greatly improve the repository; starting from this mapping, the query planner
quality of available• information, the need arises
Service Publishers for systems
register Service produces an optimized
Mart definitions query the
within execution
serviceplan, which dictates
and tools able to holistically tackle and
repository, the problem
declare ofthe complex
connectionthepatterns
sequenceusable
of stepstoforjoin
executing
them. theThe query. Finally, the
queries, while enabling users to select, explore and combine
registration process is realized through a execution engine actually executes the query plan, by sub-
Service Registration Tool that: 1) helps
data sources in a customized way. A tight integration of mitting the service calls to designated services through the
DIADEM and SeCo can theprovide
publisher in the specification
an answer to such need by of the SM, AP invocation
service and SI attributes
framework,and building
parameters the query results by
respectively and 2) it
combining high-precision data extraction, multi-domain ser- hides to the user the Internal API,
combining the outputs produced that allow the calls, comput-
by service
communication
vice integration, and exploratory search between
interactionthe [3]. services
We andthethe
ing engine
global levels.
ranking Theresults,
of query serviceand producing the
demonstrate how the data publishers
extractionare in charge
facilities provided of byimplementing mediators,
query result outputs wrappers,
in an order or thatdata
reflects their global
DIADEM enable the data integration performed
materialization components, in SeCoso asto to make
relevance.
data sources compatible with the
easily achieve novel, multi-domain
Service Mart search services
standard over large
interface and expected behavior.
number of Web sites. 3. applications,
AUTOMATIC DATA EXTRACTION
• Expert Users configure Search Computing by selecting the Service
The paper is organized as follows: Section 2 describes the
Marts of interest, by choosing
search computing approach to information integration, Sec- a data source WITH
supporting DIADEM
the Service Mart, and by
tion 3 presents the DIADEMconnecting them through
approach connection patterns.
to data extraction, They also configure
A framework such as SeCo the complexity
allows the user to search for
of the user interface, in
Section 4 discusses integration issues, Section 5 concludes terms of controls and configurability
objects with a given choices
specificationto be left to
rather than just for poten-
the paper. the end user. tially relevant Web documents as keyword search engines.
• End Users use Search Computing applications To thatconfigured
end, structured data isusers.
by expert required,
Theywhere objects and
2. WEB DATAinteract INTEGRATION WITH their attributes are described in a well understood schema.
by submitting queries, inspecting results, and refining/evolving their
Unfortunately, most commercial Web sites do not provide
SEARCH COMPUTING information need according to an exploratory information
their objects (such as seeking
job listing,approach,
properties, or products)
Figure 1 shows an which overviewwe ofcallthe
Liquid
Search Query [4].
Computing as structured data. This is particularly true for businesses
framework, whichSearch
comprises Computing aims at building
several sub-frameworks. The two withnew little
communities of users: Content
technical expertise.
service descriptionproviders,
frameworkwho (SDF) want provides the scaffold-
to organize Automatically
their content (now turning
in the format existing
of data Web sites into structured
collections,
ing for wrapping databases,
and registering data sources in service data has been mostly an
Web pages) in order to make it available for search access by third parties, unrealized dream in the past. Pre-
marts, describing and
the information sources at different levels vious approaches
expert users, who want to offer new services built by composing domain-specificto fully-automated data extraction ad-
of abstraction. The user framework provides functionality dressed the problem by investigating general techniques that
content in order to go "beyond" general-purpose
and storage for registering users, with different roles and ca-
search engines such as Google and
can be applied to any web site [4]. W.r.t. existing ap-
pabilities. The query framework supports the management proaches, DIADEM is based on a fundamental observation:
and storage of queries as first class citizens: a query can be if we combine knowledge about a domain (e.g., that a four
executed, saved, modified, and published for other users to figure price is more likely a rent price than a sales price in
see. The service invocation framework masks the technical real estate) with knowledge about the appearance of objects
issues involved in the interaction with the service mart, e.g., and search facilities in that domain (phenomenology), we
the Web service protocol and data caching issues. The core can automatically derive an extraction program for nearly
of the framework aims at executing multi-domain queries. any web page in the domain. The resulting program pro-
The query manager takes care of splitting the query into sub- duces high precision data, as we use domain knowledge to
queries (e.g., “Which jobs as Java developer are available in improve recognition and alignment and to verify the extrac-
the Silicon Valley?”, “Where are affordable, nearby flats?”, tion program based on ontological constraints [5].
Domain Knowledge Not only HTML. In many domains non-HTML data makes
Annotation types & rules Attribute types & constraints Record types & constraints
up a small, but significant part of the description of objects,
usually as PDF documents, but sometimes just as bitmap
Data images. Sometimes, this information is just supporting the
Fact generation Page Phenomenological Attribute Segmentation Area
URL & annotation Model Mapping Model Mapping
Model structured data (e.g., the pictures of a car an auto-trading
website); in other cases, however, these web resources carry
additional information that is not present in the structured
Figure 2: DIADEM’s result-page analysis data and therefore cannot be accessed by either traditional
nor object search-engines.
Data Area Model Page Model For instance, in almost all the UK real-estate Web sites,
1
Page
1
1 Node users cannot search for an apartment by energy efficiency
precedes
1
text: string
*
or by size of the rooms despite this information is clearly
*
0..1 1
present on the websites. The reason is that the energy ef-
1
Component Attribute Model ficiency of a house is published as an EPC (Energy Per-
formance Certificate) chart2 and the sizes of the rooms are
1..* *
Attribute
1
*
Data Area
1
Separator Record
published in the floor-plan images.
Location Attribute
The automated extraction of this data is non trivial since
«creates»
*
Attribute Criterion … it might require computer vision and OCR techniques. DI-
Record Constraint
*
multiplicity: string Price Attribute ADEM addresses this problem by exploiting the knowledge
«refers to»
of the domain to improve existing image and PDF/PS anal-
Required Attribute Excluded Attribute
ysis techniques. As an example, the structure of the EPC
charts is standardized by a EU directive, therefore it is easy
Figure 3: DIADEM’s result-page model to “reverse-engineer” their semantics. For PDF brochures,
it is possible to adopt analysis techniques similar to those
adopted for HTML, since the structure of such documents
DIADEM operates in two modes: in the analysis mode is also reducible to few patterns that can be easily identified
a web site is scrutinized to find relevant objects and search by an automatic analysis.
forms and to understand how to extract all data from that
site. In the extraction mode, this knowledge is used to ex- 4. TOWARD MULTI-DOMAIN, AUTO-
tract all data at high speed, assuming that the site has not MATED WEB DATA CONSUMPTION
changed fundamentally since the analysis.
In analysis mode, DIADEM answers primarily three ques- Our approach for the integration of structured and un-
tions: (1) How do we have to navigate the site (e.g., by structured Web data sources is based on a service-oriented
clicking on links, following pagination links, etc.) to extract vision of the resources. The source integration operates at
all the results? (2) Are there any forms to fill and how three levels: wrapping, registration, and invocation.
to fill them to find all results? (3) How are result records Service wrapping consists in implementing appropriate
and their attributes structured and displayed? For each of wrapping components that take care of invoking the ser-
these questions, DIADEM uses both domain-independent vices and manipulating the input and output so as to be
heuristics encoding typical web design patterns and domain- consistent with the formats expected by the integration plat-
dependent clues and high-level knowledge to locate specific form. The SeCo platform natively supports generic Web
objects and their attributes and to verify and align the re- services, relational databases, YQL services, SPARQL end-
sulting structured data. Except for a thin browser interac- points, etc. However, the system is open to support addi-
tion layer and some off-the-shelf machine learning tools, the tional data source types.
whole process is encoded in logical rules maybe involving We suggest two ways for integrating DIADEM data
probabilistic knowledge. sources into SeCo. In both cases, we assume that the schema
Finally, all the collected models are passed to the OXPath used in SeCo matches (a fragment of) the domain ontology
generator that uses simple heuristics to create a generalized used in DIADEM. The first, off-line approach extracts all
OXPath expression for use in extraction mode. the data of a site contxtually with the analysis and stores it,
To illustrate how DIADEM analyses a Web site, we fo- e.g., in an RDF database together with the domain ontology.
cus on result page analysis (the third question), see Fig- This database can be accessed as any other SPARQL end-
ure 2. First we extract the page model from a live render- point. The advantage of this approach is that it provides
ing of the Web page. This model logically represents the very good query performance, but at the cost of storage
DOM tree of the page along with information on the vi- and consistency. In domains with fast changing data, the
sual rendering (e.g., CSS boxes), and linguistic annotations. database will often be outdated compared to the data on
The information provided by the browser model is mainly the live web site.
domain-independent (e.g., DOM structure and CSS boxes) This deficit is addressed by the on-line approach, where
while some of the linguistic annotations are generated by an OXPath expression is generated by the DIADEM anal-
domain-specific gazetteers and rules. In the next step, we ysis and that expression is executed to extract the data at
locate mandatory attributes of the records that we expect query time. A slightly specialized OXPath invoker is needed
to find on a web page of a given domain; then, we pro- for this approach, as it needs to store the OXPath expression
ceed to the segmentation of the page into records through together with possible parameters for form filling. OXPath
domain-independent heuristics. The identified records are returns the extracted data in XML or RDF format struc-
then validated using a result-page model, see Figure 3. 2
wikipedia.org/wiki/Energy_Performance_Certificate
CachingInvoker Invoker
scheduler : ScheduledExecutorService
cache : Cache
wrappedInvoker
CompositeInvoker RandomAccessInvoker
scheduler : ScheduledExecutorService
CustomInvoker HttpInvoker RDBInvoker
path : String client : HTTPClient poolsize : Integer
classLoaders : Map connectionPools : Map
SPARQLInvoker YQLinvoker GoogleBaseInvoker OXPathInvoker
endpointURL : String urlTemplate : String authorizationKey : String
prefixes : Prefix [0..*] urlTemplate : String
Figure 4: UML class diagram of the SeCo invokers
tured according to the SeCo schema. The latter is ensured 5. CONCLUSIONS
by the construction process in the analysis, where the SeCo Rich object search is one of the major challenges in Web
schema in form of the high-level DIADEM ontology is used research. In this paper, we show how a combination of
to verify the extraction expression. SeCo and DIADEM has the potential to address the ma-
The disadvantage of this approach is that for large or jor challenges involved in object search: (1) the integration
complex Web sites extraction may take too long for on-line of multi-domain data sources including an easy interface for
queries. This can be slightly alleviated by the high-level formulating and refining expressive, multi-domain queries.
caching provided in SeCo. In the future, we plan to investi- (2) the automatic extraction of highly accurate, structured
gate techniques for incremental data extraction, where only data from most existing web sites.
new data is extracted. This is also useful for the off-line We plan to further investigate the integration of SeCo
approach if frequent updates are desired. and DIADEM. In particular, a further alignment of the
Service description in SeCo is based on the registration of conceptual descriptions, access patterns, and service inter-
services within the Service Description Framework model, faces would be useful. We are currently investigating the
which describes services at three levels of abstraction: Ser- automatic extraction of rich access patterns and integrity
vice Marts (abstractions of several Web services dealing with constraints from existing Web forms. We also plan to de-
the same conceptual objects available on the Web such as velop techniques for incremental data extraction to allow the
“flights”, “hotels”, “restaurants”), Access Patterns (a spe- wrapping of time-sensitive services.
cific signature of the Service Mart with the characteriza-
tion of each attribute as input, output, and/or ranking),
and service interfaces (a description of the invocation inter-
6. REFERENCES
[1] Ceri, S., Brambilla, M., eds.: Search Computing Trends
face of an actual source service)—leading from the concep-
and Developments. Volume 6585. Springer (2011)
tual representation of Web objects to the implementation
of search services. If we combine SeCo with DIADEM, we [2] Furche, T., Gottlob, G., Grasso, G., Schallhart, C.,
can easy instantiate service descriptions for any Website of Sellers, A.: Oxpath: A language for scalable,
a domain. Starting from a description of conceptual objects memory-efficient data extraction from web applications.
of a domain, shared between the SeCo service marts and In: VLDB. (2011)
the DIADEM high-level ontology, DIADEM can automat- [3] Bozzon, A., Brambilla, M., Ceri, S., Fraternali, P.:
ically recognize existing access patterns (by form analysis) Liquid query: multi-domain exploratory search on the
and translate them into SeCo service descriptions. web. In: Proceedings of the 19th international
Service execution is performed by an engine, which ex- conference on World wide web. WWW ’10, New York,
ploits the Service Description Framework. The execution NY, USA, ACM (2010) 161–170
engine consists of a runtime (a Panta Rhei [6] interpreter [4] Kayed, M., Kayed, M., Girgis, M.R., Shaalan, K.F.: A
able to translate an execution plan in a coordinated sequence survey of web information extraction systems. IEEE
of service invocations) and a set of service invokers. Low- Transactions on Knowledge and Data Engineering
level service invokers (one for each data source type, includ- 18(10) (2006) 1411–1428
ing the one for on-line DIADEM sources) are implemented [5] Furche, T., Gottlob, G., et al.: Real understanding of
and follow the chain of responsibility pattern (see Figure 4). real estate forms. In: WIMS ’11, New York, NY, USA,
There is no need for a special invoker for off-line DIADEM ACM (2011) 13:1–13:12
sources, as those reduce to SPARQL Invokers where the data [6] Braga, D., Corcoglioniti, F., Grossniklaus, M., Vadacca,
is the result of the off-line extraction. An high-level caching S.: Panta rhei: Optimized and ranked data processing
invoker wraps the sequence of low-level invokers to read re- over heterogeneous sources. In: ICSOC 2010. Volume
sults from the cache. 6470 of Lecture Notes in Computer Science. Springer
(2010) 715–716