=Paper= {{Paper |id=None |storemode=property |title=Search Computing Meets Data Extraction |pdfUrl=https://ceur-ws.org/Vol-880/VLDS-p58-Furche.pdf |volume=Vol-880 |dblpUrl=https://dblp.org/rec/conf/vlds/BozzonFOPTV11 }} ==Search Computing Meets Data Extraction== https://ceur-ws.org/Vol-880/VLDS-p58-Furche.pdf
                          Search Computing Meets Data Extraction∗

                           Tim Furche, Giorgio Orsi                                        Alessandro Bozzon, Chiara Pasini, Luca
              Oxford University, Department of Computer                                        Tettamanti, Salvatore Vadacca
             Science, Wolfson Building, Parks Road, Oxford                                                Politecnico di Milano,
                               OX1 3QD                                                             Via Ponzio 34/5, 20133 Milano, Italy
                   firstname.lastname@cs.ox.ac.uk                                                firstname.lastname@elet.polimi.it


ABSTRACT                                                                                   tunately, not addressed by current search engines. From a
Thanks to the Web, access to an increasing wealth and vari-                                vast list of potential sources, it is left to the user to manually
ety of information has become near instantaneous. To make                                  extract and integrate the relevant data.
informed decisions, however, we often need to access data                                     The Search Computing (SeCo) project [1] aims at
from many different sources and integrate different types of                               building concepts, algorithms, tools, and technologies to
information. Manually collecting data from scores of web                                   support complex Web queries, through a new paradigm
sites and combining that data remains a daunting task.                                     based on combining data extraction from distinct sources
   The ERC projects SeCo (Search Computing) and DIA-                                       and data integration by means of specialized integration
DEM (Domain-centric Intelligent Automated Data Extrac-                                     engines. Web data is typically published in two ways: as
tion Methodology) address two aspects of this problem: SeCo                                structured (and possibly linked) data accessible trough Web
supports complex search processes drawing on data from                                     APIs (e.g. SPARQL, YQL, etc.), and as unstructured re-
multiple domains with a user interface capable of refining                                 sources (i.e. Web pages), possibly accessible only through
and exploring the search results. DIADEM aims to auto-                                     user-interaction such as form filling or link navigation.
matically extract structured data from a domain’s websites.                                   Unstructured data is typically accessible to general-
   In this paper, we outline a first approach for integrating                              purpose search engines, which exploits traditional informa-
SeCo and DIADEM. We discuss how to use the DIADEM                                          tion retrieval techniques. To enable the consumption of such
methodology to automatically turn nearly any website from                                  data by automated processes, data accessible to humans
a given domain into a SeCo search service. We describe how                                 through existing Web interfaces needs to be transformed
such services can be registered and exploited by the SeCo                                  into structured information: therefore, there is the need for
framework in combination with services from other domains                                  data extraction tools (e.g. screen scrapers); unfortunately,
(and possibly developed with other methodologies).                                         the interactive nature of modern Web interfaces poses a big
                                                                                           challenge, as the dynamic nature of these user interfaces,
                                                                                           driven by client and server-side scripting, creates challenges
1.      INTRODUCTION                                                                       for automated processes to access this information.
  Recent years witnessed a paradigmatic shift in the way                                      The DIADEM1 (Domain-centric Intelligent Automated
people deal with information. The Web provides cheap                                       Data Extraction Methodology) project aims at developing
and ubiquitous access to an increasing wealth and variety                                  domain-specific data extraction systems that take as input a
of data. Yet, making informed decisions, which often re-                                   URL of a Web site in a particular application domain, auto-
quire complex and articulated information retrieval tasks                                  matically explore the Web site, and deliver as output a struc-
involving access to information from many different sources,                               tured data set containing all the relevant information present
remains a daunting task. Queries such as “Retrieve jobs                                    on that site. It is based on a novel, knowledge-driven ap-
as Java Developer in the Silicon Valley, nearby affordable                                 proach that combines low-level annotations with high-level
fully-furnished flats, and close to good schools” are, unfor-                              domain knowledge and sophisticated analysis rules encoding
                                                                                           common Web design patterns. The first prototype for the
∗The research leading to these results has received funding
                                                                                           UK real-estate domain outperforms existing data extraction
from the European Research Council under the European                                      tools and validates the premise that with a thin layer of
Community’s Seventh Framework Programme (FP7/2007–
2013) / ERC grant agreement no. 246858 (DIADEM) and                                        domain-specific knowledge, nearly perfect automated data
the 2008 Call for “IDEAS Advanced Grants” as part of the                                   extraction is feasible.
Search Computing (SeCo) project.                                                              Once a web site is analyzed, the DIADEM engine can pro-
                                                                                           vide a one-time copy of all the data of that site, structured
                                                                                           according to the provided schema. Alternatively, an extrac-
Permission to make digital or hard copies of all or part of this work for                  tion expression, formulated in OXPath [2], can be returned
 Permission
personal      to make digital
          or classroom    use isor   hard copies
                                  granted   withoutoffeeall provided
                                                            or part ofthatthiscopies
                                                                               work are
                                                                                     for   that extracts all the data on-demand at high-speed.
 personal
not made or   classroom use
           or distributed       is granted
                           for profit        without feeadvantage
                                       or commercial        provided andthat that
                                                                              copies are
                                                                                  copies
 not made
bear        or distributed
     this notice            for citation
                  and the full   profit oron
                                           commercial     advantage
                                             the first page.   To copy  and  that copies
                                                                           otherwise, to   1.1      Motivations and Outline
 bear this notice
republish,  to postandon
                       the servers
                           full citation
                                     or toonredistribute
                                              the first page.to To  copyrequires
                                                                 lists,    otherwise, to
                                                                                   prior
 republish, to post on servers or to redistribute to lists, requires prior specific
                                                                                              As users get acquainted with on-line search and deci-
specific permission and/or a fee.This paper was presented at                               sion support systems, their information needs evolve, their
 permission and/or a fee. This article was presented at:
Very Large Data Search (VLDS) 2011.
 WORKSHOP NAME.                                                                             1
Copyright
 Copyright2011.
             2011.                                                                              diadem-project.info.
                                                       ;#21&                ;#$<./#&               ?@,#$0&                                               ?)7
                                                       #@,#$0               ,"F-.8G#$              "8#$                                                  "8#$


                                                                                                                                                 5.6".7 !"#$.#8        2-.#)0&
                                                     !"#$%&>-()&         ;#$<./#&           +,,-./(0.1)&                                               9:            (,,-./(0.1)
                                                     =#3.)#B#)0&       =#*.80$(0.1)        21)3.*"$(0.1)&
                                                        411-              411-                 411-                                                        ./"#%"+'D'#"+/5,+       ("21'D'
                                                                                                                                                                                    :#%,"
                                 ./"#%"+'D'#"+/5,+                                                                                                      ?@0#$)(- +>:&&&H=?;4I                E))5%&2,%*4
                                                                                                                                                                                             (")*+%,*#-
                                                                                            :)0#$)(- +>:




                                                                                                                                           0+"#'ABA,C
                                 ?#&@"+,#2,%*4
                                                                                                                                                                                   ("21'D'
                                                                   !"#$%&'()(*#$




                                                                                                                     !"#$%&"'ABA,C
                                                                                                                                                                                    :#%,"




                                                                                                                                     ./"#-'ABA,C
                                                                                           32&@"                                                            98#$ A$(B#C1$D                   0+"# 12,2'
                                                                                                                                                                           32&@"             #")*+%,*#-
                                                                     ./"#%"+




                                                                                                    82,2'#",#%"$25
                                                                                                                                                                                   ("21'D'
                                                          !"#$% >-())#$                                                                                                             :#%,"
                                                                                                                                                           !"#$% A$(B#C1$D                     ./"#-
                                                                           32&@"                                                                                           32&@"             (")*+%,*#-
                                                                     =>"&/,%*4')524+
                                                                                                                                                                                   ("21'D'
                                                        ?@#/"0.1) ?)*.)#                                                                                                            :#%,"
                                                                                        ./"#-'                                                          ;#$<./#&'($0 A$(B#C1$D                !"#$%&"'
                                                                           32&@"        #"+/5,+                                                                            32&@"             (")*+%,*#-
                        F"B"41
                                                                                                    !"#$%&"'&255+
                                   3*4,#*5'1")"41"4&%"+'6/+"+7
                                                                                             !"#$%&"'&255'#"+/5,+                                          ;#$<./#&:)<1/(0.1)                ;#$<./#8
                                   82,2'95*:+'6;/"#%"+<'#"+/5,+7                                                                                           A$(B#C1$D         E
                                                                                                                                                                          32&@"


                                                 Fig. 1. Overview of the Search Computing framework
                                                 Figure 1: The Search Computing architecture


queries become more To and
                         obtain
                              morea specific
                                      complex,Search
                                                   and theirComputing
                                                                  demand application,
                                                                                  “Where theare general-purpose
                                                                                                  good schools?”) architecture
                                                                                                                        and binding them to the re-
                    of  Figure   1 is customized        with
for correct and updated data increases. Whilst data extrac-     the  help   of tools targeted   to
                                                                                  spective relevantprogrammers,
                                                                                                       data sourcesexpert      users,in the service mart
                                                                                                                        registered
tion approaches suchand as
                         endDIADEM
                               users. can greatly improve the                     repository; starting from this mapping, the query planner
quality of available• information,     the need arises
                          Service Publishers                 for systems
                                                          register      Service produces      an optimized
                                                                                   Mart definitions           query the
                                                                                                           within     execution
                                                                                                                             serviceplan, which dictates
and tools able to holistically    tackle and
                          repository,     the problem
                                                  declare ofthe  complex
                                                                       connectionthepatterns
                                                                                       sequenceusable
                                                                                                   of stepstoforjoin
                                                                                                                  executing
                                                                                                                        them. theThe query. Finally, the
queries, while enabling users to select, explore and combine
                          registration process is realized through a execution                 engine actually executes the query plan, by sub-
                                                                                   Service Registration       Tool that: 1) helps
data sources in a customized way. A tight integration of                          mitting the service calls to designated services through the
DIADEM and SeCo can       theprovide
                              publisher      in the specification
                                       an answer      to such need by     of the SM,   AP invocation
                                                                                  service   and SI attributes
                                                                                                        framework,and building
                                                                                                                        parameters the query results by
                          respectively       and    2)   it
combining high-precision data extraction, multi-domain ser-   hides     to  the  user   the   Internal   API,
                                                                                  combining the outputs produced  that   allow    the calls, comput-
                                                                                                                             by service
                          communication
vice integration, and exploratory       search between
                                                  interactionthe  [3]. services
                                                                        We         andthethe
                                                                                  ing           engine
                                                                                           global         levels.
                                                                                                    ranking          Theresults,
                                                                                                               of query      serviceand producing the
demonstrate how the data  publishers
                                extractionare       in charge
                                               facilities   provided  of byimplementing      mediators,
                                                                                  query result     outputs wrappers,
                                                                                                             in an order or  thatdata
                                                                                                                                    reflects their global
DIADEM enable the data        integration performed
                          materialization       components,   in SeCoso asto to make
                                                                                  relevance.
                                                                                         data sources compatible with the
easily achieve novel, multi-domain
                          Service Mart   search    services
                                             standard          over large
                                                         interface      and expected behavior.
number of Web sites.                                                              3. applications,
                                                                                        AUTOMATIC                   DATA           EXTRACTION
                    • Expert Users configure Search Computing                                          by selecting     the Service
   The paper is organized as follows: Section 2 describes the
                          Marts   of  interest,    by   choosing
search computing approach to information integration, Sec-            a  data  source   WITH
                                                                                       supporting   DIADEM
                                                                                                     the  Service    Mart,    and  by
tion 3 presents the DIADEMconnecting     them through
                                    approach                  connection patterns.
                                                 to data extraction,                    They also configure
                                                                                     A framework      such as SeCo the complexity
                                                                                                                        allows the user to search for
                          of the  user   interface,     in
Section 4 discusses integration issues, Section 5 concludes terms    of  controls  and  configurability
                                                                                  objects  with   a given   choices
                                                                                                          specificationto be   left to
                                                                                                                            rather  than just for poten-
the paper.                the end user.                                           tially relevant Web documents as keyword search engines.
                    • End Users use Search Computing applications                 To thatconfigured
                                                                                            end, structured     data isusers.
                                                                                                         by expert        required,
                                                                                                                                Theywhere objects and
2. WEB DATAinteract           INTEGRATION                        WITH             their attributes are described in a well understood schema.
                                     by submitting queries, inspecting                 results, and refining/evolving their
                                                                                  Unfortunately, most commercial Web sites do not provide
      SEARCH COMPUTING    information need according to an exploratory                       information
                                                                                  their objects    (such as seeking
                                                                                                              job listing,approach,
                                                                                                                              properties, or products)
   Figure 1 shows an which overviewwe ofcallthe
                                              Liquid
                                                 Search Query     [4].
                                                             Computing            as structured data. This is particularly true for businesses
framework, whichSearch
                     comprises Computing         aims at building
                                   several sub-frameworks.             The two withnew little
                                                                                          communities       of users: Content
                                                                                               technical expertise.
service descriptionproviders,
                      frameworkwho  (SDF) want provides    the scaffold-
                                                  to organize                        Automatically
                                                                    their content (now                 turning
                                                                                           in the format         existing
                                                                                                             of data         Web sites into structured
                                                                                                                       collections,
ing for wrapping databases,
                     and registering      data     sources     in  service        data  has  been   mostly   an
                                  Web pages) in order to make it available for search access by third parties,   unrealized    dream in the past. Pre-
marts, describing and
                    the information      sources     at different    levels       vious   approaches
                         expert users, who want to offer new services built by composing domain-specificto   fully-automated       data extraction ad-
of abstraction. The user framework provides functionality                         dressed the problem by investigating general techniques that
                    content in order to go "beyond" general-purpose
and storage for registering users, with different roles and ca-
                                                                                        search engines such as Google and
                                                                                  can be applied to any web site [4]. W.r.t. existing ap-
pabilities. The query framework supports the management                           proaches, DIADEM is based on a fundamental observation:
and storage of queries as first class citizens: a query can be                    if we combine knowledge about a domain (e.g., that a four
executed, saved, modified, and published for other users to                       figure price is more likely a rent price than a sales price in
see. The service invocation framework masks the technical                         real estate) with knowledge about the appearance of objects
issues involved in the interaction with the service mart, e.g.,                   and search facilities in that domain (phenomenology), we
the Web service protocol and data caching issues. The core                        can automatically derive an extraction program for nearly
of the framework aims at executing multi-domain queries.                          any web page in the domain. The resulting program pro-
The query manager takes care of splitting the query into sub-                     duces high precision data, as we use domain knowledge to
queries (e.g., “Which jobs as Java developer are available in                     improve recognition and alignment and to verify the extrac-
the Silicon Valley?”, “Where are affordable, nearby flats?”,                      tion program based on ontological constraints [5].
                Domain Knowledge                                                                                                                            Not only HTML. In many domains non-HTML data makes
                     Annotation types & rules                     Attribute types & constraints               Record types & constraints
                                                                                                                                                            up a small, but significant part of the description of objects,
                                                                                                                                                            usually as PDF documents, but sometimes just as bitmap
                                                                                                                                                   Data     images. Sometimes, this information is just supporting the
                          Fact generation          Page              Phenomenological             Attribute       Segmentation                     Area
  URL                      & annotation            Model                 Mapping                   Model            Mapping
                                                                                                                                                   Model    structured data (e.g., the pictures of a car an auto-trading
                                                                                                                                                            website); in other cases, however, these web resources carry
                                                                                                                                                            additional information that is not present in the structured
                    Figure 2: DIADEM’s result-page analysis                                                                                                 data and therefore cannot be accessed by either traditional
                                                                                                                                                            nor object search-engines.
Data Area Model                                                                                                    Page Model                                  For instance, in almost all the UK real-estate Web sites,
                           1
     Page
            1
                                                                                                                                1              Node         users cannot search for an apartment by energy efficiency
                               precedes
                                                                                                                                 1
                                                                                                                                      text:        string
                                                                                                                                                   *
                                                                                                                                                            or by size of the rooms despite this information is clearly
                                                                                                                                           *


                                            0..1    1
                                                                                                                                                            present on the websites. The reason is that the energy ef-
                                1
                                     Component                                                                     Attribute Model                          ficiency of a house is published as an EPC (Energy Per-
                                                                                                                                                            formance Certificate) chart2 and the sizes of the rooms are
                                    1..*                                                                                                       *
                                                                                                                                     Attribute
                1
        *
      Data Area
                      1

                                     Separator                  Record
                                                                                                                                                            published in the floor-plan images.
                                                                                                                             Location Attribute
                                                                                                                                                               The automated extraction of this data is non trivial since
                                                   «creates»
                                                                    *
                                                                         Attribute Criterion                                                   …            it might require computer vision and OCR techniques. DI-
                                    Record Constraint
                                                            *
                                                                        multiplicity: string                                   Price Attribute              ADEM addresses this problem by exploiting the knowledge
                                                                                                                                       «refers to»
                                                                                                                                                            of the domain to improve existing image and PDF/PS anal-
                                                        Required Attribute               Excluded Attribute
                                                                                                                                                            ysis techniques. As an example, the structure of the EPC
                                                                                                                                                            charts is standardized by a EU directive, therefore it is easy
                    Figure 3: DIADEM’s result-page model                                                                                                    to “reverse-engineer” their semantics. For PDF brochures,
                                                                                                                                                            it is possible to adopt analysis techniques similar to those
                                                                                                                                                            adopted for HTML, since the structure of such documents
   DIADEM operates in two modes: in the analysis mode                                                                                                       is also reducible to few patterns that can be easily identified
a web site is scrutinized to find relevant objects and search                                                                                               by an automatic analysis.
forms and to understand how to extract all data from that
site. In the extraction mode, this knowledge is used to ex-                                                                                                 4.     TOWARD MULTI-DOMAIN, AUTO-
tract all data at high speed, assuming that the site has not                                                                                                       MATED WEB DATA CONSUMPTION
changed fundamentally since the analysis.
   In analysis mode, DIADEM answers primarily three ques-                                                                                                      Our approach for the integration of structured and un-
tions: (1) How do we have to navigate the site (e.g., by                                                                                                    structured Web data sources is based on a service-oriented
clicking on links, following pagination links, etc.) to extract                                                                                             vision of the resources. The source integration operates at
all the results? (2) Are there any forms to fill and how                                                                                                    three levels: wrapping, registration, and invocation.
to fill them to find all results? (3) How are result records                                                                                                   Service wrapping consists in implementing appropriate
and their attributes structured and displayed? For each of                                                                                                  wrapping components that take care of invoking the ser-
these questions, DIADEM uses both domain-independent                                                                                                        vices and manipulating the input and output so as to be
heuristics encoding typical web design patterns and domain-                                                                                                 consistent with the formats expected by the integration plat-
dependent clues and high-level knowledge to locate specific                                                                                                 form. The SeCo platform natively supports generic Web
objects and their attributes and to verify and align the re-                                                                                                services, relational databases, YQL services, SPARQL end-
sulting structured data. Except for a thin browser interac-                                                                                                 points, etc. However, the system is open to support addi-
tion layer and some off-the-shelf machine learning tools, the                                                                                               tional data source types.
whole process is encoded in logical rules maybe involving                                                                                                      We suggest two ways for integrating DIADEM data
probabilistic knowledge.                                                                                                                                    sources into SeCo. In both cases, we assume that the schema
   Finally, all the collected models are passed to the OXPath                                                                                               used in SeCo matches (a fragment of) the domain ontology
generator that uses simple heuristics to create a generalized                                                                                               used in DIADEM. The first, off-line approach extracts all
OXPath expression for use in extraction mode.                                                                                                               the data of a site contxtually with the analysis and stores it,
   To illustrate how DIADEM analyses a Web site, we fo-                                                                                                     e.g., in an RDF database together with the domain ontology.
cus on result page analysis (the third question), see Fig-                                                                                                  This database can be accessed as any other SPARQL end-
ure 2. First we extract the page model from a live render-                                                                                                  point. The advantage of this approach is that it provides
ing of the Web page. This model logically represents the                                                                                                    very good query performance, but at the cost of storage
DOM tree of the page along with information on the vi-                                                                                                      and consistency. In domains with fast changing data, the
sual rendering (e.g., CSS boxes), and linguistic annotations.                                                                                               database will often be outdated compared to the data on
The information provided by the browser model is mainly                                                                                                     the live web site.
domain-independent (e.g., DOM structure and CSS boxes)                                                                                                         This deficit is addressed by the on-line approach, where
while some of the linguistic annotations are generated by                                                                                                   an OXPath expression is generated by the DIADEM anal-
domain-specific gazetteers and rules. In the next step, we                                                                                                  ysis and that expression is executed to extract the data at
locate mandatory attributes of the records that we expect                                                                                                   query time. A slightly specialized OXPath invoker is needed
to find on a web page of a given domain; then, we pro-                                                                                                      for this approach, as it needs to store the OXPath expression
ceed to the segmentation of the page into records through                                                                                                   together with possible parameters for form filling. OXPath
domain-independent heuristics. The identified records are                                                                                                   returns the extracted data in XML or RDF format struc-
then validated using a result-page model, see Figure 3.                                                                                                     2
                                                                                                                                                                wikipedia.org/wiki/Energy_Performance_Certificate
                                               CachingInvoker                                         Invoker

                                   scheduler : ScheduledExecutorService
                                   cache : Cache


                                   wrappedInvoker


                                              CompositeInvoker                                RandomAccessInvoker

                                                                                       scheduler : ScheduledExecutorService




                                     CustomInvoker                               HttpInvoker                                     RDBInvoker

                         path : String                                       client : HTTPClient                poolsize : Integer
                         classLoaders : Map                                                connectionPools : Map




                                  SPARQLInvoker                 YQLinvoker                  GoogleBaseInvoker                 OXPathInvoker

                               endpointURL : String       urlTemplate : String         authorizationKey : String
                               prefixes : Prefix [0..*]                                urlTemplate : String



                                         Figure 4: UML class diagram of the SeCo invokers


tured according to the SeCo schema. The latter is ensured                                          5.       CONCLUSIONS
by the construction process in the analysis, where the SeCo                                          Rich object search is one of the major challenges in Web
schema in form of the high-level DIADEM ontology is used                                           research. In this paper, we show how a combination of
to verify the extraction expression.                                                               SeCo and DIADEM has the potential to address the ma-
   The disadvantage of this approach is that for large or                                          jor challenges involved in object search: (1) the integration
complex Web sites extraction may take too long for on-line                                         of multi-domain data sources including an easy interface for
queries. This can be slightly alleviated by the high-level                                         formulating and refining expressive, multi-domain queries.
caching provided in SeCo. In the future, we plan to investi-                                       (2) the automatic extraction of highly accurate, structured
gate techniques for incremental data extraction, where only                                        data from most existing web sites.
new data is extracted. This is also useful for the off-line                                          We plan to further investigate the integration of SeCo
approach if frequent updates are desired.                                                          and DIADEM. In particular, a further alignment of the
   Service description in SeCo is based on the registration of                                     conceptual descriptions, access patterns, and service inter-
services within the Service Description Framework model,                                           faces would be useful. We are currently investigating the
which describes services at three levels of abstraction: Ser-                                      automatic extraction of rich access patterns and integrity
vice Marts (abstractions of several Web services dealing with                                      constraints from existing Web forms. We also plan to de-
the same conceptual objects available on the Web such as                                           velop techniques for incremental data extraction to allow the
“flights”, “hotels”, “restaurants”), Access Patterns (a spe-                                       wrapping of time-sensitive services.
cific signature of the Service Mart with the characteriza-
tion of each attribute as input, output, and/or ranking),
and service interfaces (a description of the invocation inter-
                                                                                                   6.       REFERENCES
                                                                                                   [1] Ceri, S., Brambilla, M., eds.: Search Computing Trends
face of an actual source service)—leading from the concep-
                                                                                                       and Developments. Volume 6585. Springer (2011)
tual representation of Web objects to the implementation
of search services. If we combine SeCo with DIADEM, we                                             [2] Furche, T., Gottlob, G., Grasso, G., Schallhart, C.,
can easy instantiate service descriptions for any Website of                                           Sellers, A.: Oxpath: A language for scalable,
a domain. Starting from a description of conceptual objects                                            memory-efficient data extraction from web applications.
of a domain, shared between the SeCo service marts and                                                 In: VLDB. (2011)
the DIADEM high-level ontology, DIADEM can automat-                                                [3] Bozzon, A., Brambilla, M., Ceri, S., Fraternali, P.:
ically recognize existing access patterns (by form analysis)                                           Liquid query: multi-domain exploratory search on the
and translate them into SeCo service descriptions.                                                     web. In: Proceedings of the 19th international
   Service execution is performed by an engine, which ex-                                              conference on World wide web. WWW ’10, New York,
ploits the Service Description Framework. The execution                                                NY, USA, ACM (2010) 161–170
engine consists of a runtime (a Panta Rhei [6] interpreter                                         [4] Kayed, M., Kayed, M., Girgis, M.R., Shaalan, K.F.: A
able to translate an execution plan in a coordinated sequence                                          survey of web information extraction systems. IEEE
of service invocations) and a set of service invokers. Low-                                            Transactions on Knowledge and Data Engineering
level service invokers (one for each data source type, includ-                                         18(10) (2006) 1411–1428
ing the one for on-line DIADEM sources) are implemented                                            [5] Furche, T., Gottlob, G., et al.: Real understanding of
and follow the chain of responsibility pattern (see Figure 4).                                         real estate forms. In: WIMS ’11, New York, NY, USA,
There is no need for a special invoker for off-line DIADEM                                             ACM (2011) 13:1–13:12
sources, as those reduce to SPARQL Invokers where the data                                         [6] Braga, D., Corcoglioniti, F., Grossniklaus, M., Vadacca,
is the result of the off-line extraction. An high-level caching                                        S.: Panta rhei: Optimized and ranked data processing
invoker wraps the sequence of low-level invokers to read re-                                           over heterogeneous sources. In: ICSOC 2010. Volume
sults from the cache.                                                                                  6470 of Lecture Notes in Computer Science. Springer
                                                                                                       (2010) 715–716