=Paper= {{Paper |id=None |storemode=property |title=Cross-Fertilizing Deep Web Analysis and Ontology Enrichment |pdfUrl=https://ceur-ws.org/Vol-884/VLDS2012_p05_Oita.pdf |volume=Vol-884 |dblpUrl=https://dblp.org/rec/conf/vlds/OitaAS12 }} ==Cross-Fertilizing Deep Web Analysis and Ontology Enrichment == https://ceur-ws.org/Vol-884/VLDS2012_p05_Oita.pdf
                             Cross-Fertilizing Deep Web Analysis
                                  and Ontology Enrichment

                     Marilena Oita                              Antoine Amarilli                        Pierre Senellart
           INRIA Saclay – Île-de-France     École normale supérieure       Institut Mines–Télécom
          Télécom ParisTech; CNRS LTCI Télécom ParisTech; CNRS LTCI Télécom ParisTech; CNRS LTCI
                    Paris, France                   Paris, France                   Paris, France
            marilena.oita@telecom-                        antoine.amarilli@ens.fr               pierre.senellart@telecom-
                  paristech.fr                                                                          paristech.fr

ABSTRACT                                                                     tween the input and output schemas which associates the data types
Deep Web databases, whose content is presented as dynamically-               corresponding to form elements in the input schema to instances
generated Web pages hidden behind forms, have mostly been left               aligned in the output schema.
unindexed by search engine crawlers. In order to automatically                  A harder challenge is to understand the semantics of these data
explore this mass of information, many current techniques assume             types and how they relate to the object of the form. The input–output
the existence of domain knowledge, which is costly to create and             schema mapping may give us hints, such as the input schema labels,
maintain. In this article, we present a new perspective on form              but this information cannot suffice by itself. This has been addressed
understanding and deep Web data acquisition that does not require            in related work using heuristics [26] or an assumed domain knowl-
any domain-specific knowledge. Unlike previous approaches, we                edge [19] which is either manually crafted or obtained by merging
do not perform the various steps in the process (e.g., form under-           different form interface schemas belonging to the same domain.
standing, record identification, attribute labeling) independently but       Domain knowledge is, however, not only hard to build and maintain,
integrate them to achieve a more complete understanding of deep              but also often restricted to a choice of popular domain topics, which
Web sources. Through information extraction techniques and using             may lead to biased exploration of the deep Web.
the form itself for validation, we reconcile input and output schemas           We present a new way to deal with this challenge: we initially
in a labeled graph which is further aligned with a generic ontology.         probe the form in a domain-agnostic manner and transform the in-
The impact of this alignment is threefold: first, the resulting seman-       formation extracted from response pages into a labeled graph. This
tic infrastructure associated with the form can assist Web crawlers          graph is then aligned with a general-domain ontology, YAGO [23],
when probing the form for content indexing; second, attributes of            using the PARIS ontology alignment system [22]. This allows us to
response pages are labeled by matching known ontology instances,             infer the semantics of the deep Web source, to obtain new, represen-
and relations between attributes are uncovered; and third, we enrich         tative query terms from YAGO for the probing of form fields, and to
the generic ontology with facts from the deep Web.                           possibly enrich YAGO with new facts.

1.    ONTOLOGIES AND THE DEEP WEB
   The deep Web consists of dynamically-generated Web pages that             2.    RELATED WORK
are reachable by issuing queries through HTML forms. A form is                  Merging input schemas of deep Web interfaces has been used
a section of a document with special control elements (e.g., check-          to acquire domain ontologies automatically [29] and perform Web
boxes, text inputs) and associated labels. Users generally interact          database classification and query routing [3]. The main drawback
with a form by modifying its controls (entering text, selecting menu         of these approaches is that data integration dramatically relies on
items) before submitting it to a Web server for processing.                  the interface schema, whose shallow features (the form structure
   Forms are primarily designed for human beings, but they must              and labels) are neither complete, nor representative enough for the
also be understood by automated agents for various applications              actual response records [4].
such as general-purpose indexing of response pages, focused in-                 To obtain response pages, the form has to be filled in and submit-
dexing [13], extensional crawling strategies (e.g., Web archiving),          ted first. Most approaches described in the literature are domain-
automatic construction of ontologies [29], etc. However, most ex-            specific and use dictionary instances [19]. Domain-agnostic probing
isting approaches to automatically explore and classify the deep             approaches are more powerful because they do not make such as-
Web crucially rely on domain knowledge [10, 12, 30] to guide form            sumptions and incrementally build knowledge that tends to improve
understanding. Moreover, they tend to separate the steps of form in-         the probing and the quality of response pages. However, existing
terface understanding and information extraction from result pages,          domain-agnostic techniques do not aim at understanding the inten-
although both contribute [27] to a more authentic vision on the              sional purpose of the form, but at extensional crawling [5].
backend database schema. The form interface exposes in the input                Deep Web response pages are an extremely rich source of semi-
schema some attributes describing the query object, while response           structured information. Works dealing with response pages assume
pages present this object instantiated in Web records that outline           the form probing mechanism understood and focus on information
the form output schema. In this paper, we determine a mapping be-            extraction (IE) from Web records [8]. Extracting the schema from
                                                                             response pages [15] is possible due to the structural similarity of
VLDS’12 August 31, 2012. Istanbul, Turkey.                                   records. Because this schema has been obtained by probing the form
Copyright c 2012 for the individual papers by the papers’ authors. Copying
permitted for private and academic purposes. This volume is published and    and analyzing the response pages, it is called the output schema of
copyrighted by its editors.                                                  the form.
   The data extracted from deep Web sources through IE process-             can be applied. If the probing yields a response page which does
ing is typically used to build and/or enrich ontologies [2, 21, 24],        not contain the error pattern, then we determine the generic XPath
gazzetters [11] or to expand sets of entities [28]. ODE [21] in partic-     location of Web records using [16].
ular gets closer to our work by its holistic approach, but still needs a
domain ontology built by matching different deep Web interfaces. A          Output Schema Construction.               A way to build the output
more important difference appears in the annotation of the extracted        schema is to use the reflection of a given domain knowledge in
                                                                            response pages [25]. Another method is to perform attribute align-
data from response pages using heuristic rules for label assignment,
similar to [26]. Comparatively, we use PARIS alignment algorithm.           ment [1] for records obtained from different pages. Since Web
   The next step is the discovery of the semantic relationships be-         records represent subtrees which are structurally similar at DOM
tween the entity of the form and the record attributes; for this, several   level, we extract the values of their textual leaf nodes and cluster
                                                                            these values based on their DOM path. The rationale is that the
techniques are proposed in the literature. Traditionally, statistical
and rule-based methods use the instances in a textual context in            values found under the same record internal path are attributes of
order to infer the relation between them [9]. Another option [20]           the same type. For instance, “Great Expectations” and “David Cop-
is to match the terminology of a given term with a known concept            perfield” in Figure 1 both represent literals of the title attribute of
using semantic resources such as DBpedia or WordNet [18]. Yet               a book and have a common location pattern. We define a record
another trend is to use classifiers that can predict specific relations     feature as the association between a relevant record internal path and
(e.g., subClassOf ) given enough training and test data [6]. The            its cumulated bag of instances. The output schema for a response
closest work to ours may be [14], an approach relying on supervised         page is then defined by the ordered sequence of record features. In
learning that uses a generic ontology to infer types and relations          practice, we remove uninformative record features from the output
                                                                            schema by restricting ourselves to paths which contain different
among the data in a Web table. We deal with the more general set-
ting of deep Web interfaces here, and we propose a fully automatic          instances across various response pages.
approach that does not require human supervision.                           Input and Output Schema Mapping.                   We align input fields
                                                                            of the form with record features of the result pages in the following
3.    ENVISIONED APPROACH                                                   fashion. For non-textual form elements such as drop-down lists,
   We now present our vision of a holistic deep Web semantic un-            we check if their values do not trivially match one of the record
derstanding and ontology enrichment process, which is summarized            features of the output schema. For textual form elements, we use a
in Figure 1: a Web form is analyzed and probed, record attribute            more elaborate idea. Due to binding patterns, query instances which
values are extracted from result pages, and their types are mapped to       appear at a certain record internal path should appear again at the
input fields. While these steps are rather standard and we follow the       same location when they are submitted in the “right” input field
well-established best practices, they have never been analyzed in a         for this path. If we submit them in an unrelated field, however, we
holistic manner without the assumption of domain knowledge that             should obtain an error page or unsuitable results. Formally, given
describes the form interface. The novelty of studying these steps           a record feature f of the output schema, we can see if it maps to
in connection comes from their contribution to the formation of a           a textual input t by filling in t with one of the initial instances of
labeled graph which encompasses data values of unknown types                 f (say i) and submitting the form. Either we obtain an error page,
and implicit semantic relations. This graph is further aligned with a       which means f and t should not be mapped, or we obtain a result
generic ontology for knowledge discovery using PARIS.                       page in which we can use f ’s record internal path to extract a new
                                                                            bag I of instances for f . In this case, we say that t and f are mapped
Form Analysis and Probing.                The form interface is pre-        if all instances in I are equal to i or contain it as a substring (i.e., i
sented as an input schema which gives a prescriptive description            appears again at f ’s location pattern). We obtain the mapping by
of the object that the user can query through the form. The input           performing these steps for all couples ( f ,t).
schema is the ordered list of labels corresponding to form elements,           Most of the time, the input–output schemas do not match exactly.
possibly together with constraints and possible values (for drop-           The attributes that cannot be matched are usually explicit in the input
down lists and other non-textual input fields). Important data con-         schema (e.g., given by non-textual inputs, like drop-down lists), or
straints or properties of the backend Web database can be discovered        only present in the output schema (e.g., the price of a book).
through well-designed probing and response page analysis. Some
may be precious for a crawler that interacts with the form: Are stop        Graph Generation. We represent the data extracted from the
words indexed? Which Boolean connectors are used (conjunctive or            Web records as RDF triples [17], in the following manner:
disjunctive)? Is the search affected by spelling errors? We perform            1. each record is represented as an entity;
form probing in an agnostic manner (i.e., without domain knowl-                2. all records are of the same class, stated using rdf:type;
edge) following [5]. We try to set non-textual input elements or               3. the attribute values of records are viewed as literals;
to fill in a textual input field with stop words or with contextual            4. each record links to its attribute values through the relation
terms extracted from non-textual input controls (e.g., drop-down list              (i.e., predicate) that corresponds to the record internal path of
entries) or surrounding text (e.g., indications to the user). We rely              the attribute type in the response page;
on the fact that many sites provide a generous index (i.e., a response      Since the triples form a labeled directed graph, it is possible to
page can be obtained by inputting a single letter). A more elaborate        add much more information to the representation, provided that we
idea is to use AJAX auto-completion facilities.                             have the means to extract it. An idea would be to include a more
                                                                            detailed representation of a record by following the hyperlinks that
Record Identification. If the form has been filled in correctly,            we identify in its attribute values and replacing them in the original
we obtain a result page. Otherwise, to identify possible error pages,       response page with the DOM tree of the linked page. In this way,
our method infers a characteristic XPath expression by submitting           the extraction can be done on a more complete representation of
the form with a nonsense word and tracing its location in the DOM           the backend database. We can also add complementary data from
of the response page. This approach uses the fact that the nonsense         various sources, e.g., Web services or other Web forms belonging to
word will usually be repeated in the error page to present the er-          the same domain.
roneous input to the user. If not, techniques such as those of [19]
                                                           new
                     Form                              probing                                                                 Yago
                                                         terms                                                        y:hasName      "Othello"
     Author:                                                                       rdfs:type
                                                                                                    Othello             y:created                     y:hasName
                                                                                                                                      Shakespeare                 "Shakespeare"
        Title:                                                                                                        y:hasName
                                                                     Book
                                                                                   rdfs:type       Great                             "Great Expectations"
                                                                                                Expectations             y:created
  Publisher:                                                                                                                            Charles       y:hasName
                                                                                                                        y:created       Dickens                   "Charles Dickens"
                                                                                                    David
                                    Submit                                         rdfs:type     Copperfield
                                                      input and                                                       y:hasName
                                                                                                   (novel)                           "David Copperfield"
  form probing                                          output
                                                       schema                                                                            ontology                         ontology
                                                      mapping                                                                            alignment                      enrichment
               Result page                                                  List of records
  The following results were found for your search:               The following results were found for your search:                                   Labeled graph
               Great Expectations                                              Great Expectations                                                            "Great Expectations"
               Charles Dickens                                                                                                            rdfs:type
                                                                               Charles Dickens                                                        ?e1    "Charles Dickens"
               Dover Thrift Editions                                           Dover Thrift Editions                                                         "Dover Thrift Editions"
                                                                                                                                       ?class                "David Copperfield"
               David Copperfield                       wrapper                 David Copperfield                         RDF
               by Charles Dickens                     induction                by Charles Dickens                       triples           rdfs:type
                                                                                                                                                      ?e2    "by Charles Dickens"
               Penguin Classics                                                Penguin Classics                                                              "Penguin Books"
                                                                                                                      generation

                                                              Figure 1: Overview of the envisioned approach

Ontology Alignment.            The ontology that we compile from the                                   of the form (in our case, a book). The propagation of this knowl-
result pages is aligned with a large reference ontology. We use                                        edge to the input schema through the input–output mapping (for
YAGO [23], though our approach can be applied to any reference                                         the form elements that have been successfully mapped) results in a
ontology. We use PARIS [22] to perform the ontology alignment.                                         better understanding of the form interface. On the one hand, we can
Unlike most other systems, PARIS is able to align both entities                                        infer that a given field of the Amazon advanced search form expects
and relations. It does so by bootstrapping an alignment from the                                       author names, and leverage YAGO to obtain representative author
matching literals and propagating evidence based on relation func-                                     names to fill in the form. This is useful in intensional or extensional
tionalities. Through the alignment, we discover the class of entities,                                 automatic crawl strategies of deep Web sources. On the other hand,
the meaning of record attributes and the actual relation that exists                                   we can generate new result pages for which data location patterns
between them. Two main adaptations are needed to use PARIS in                                          are already known and enrich YAGO through the alignment that we
the deep Web data alignment process. First, extracted literals usu-                                    once determined.
ally differ from those of YAGO because of alternate spellings or                                          There are three main possibilities to enrich the ontology. First,
surrounding stop words. A typical case on Amazon is the addition                                       we can add to the ontology the instances that did not align. For
of related terms, e.g., “Hamlet (French Edition)” instead of just                                      instance, we can use the Amazon book search results to add to YAGO
“Hamlet”. To mitigate this problem we normalize the literals, elimi-                                   the books for which it has no coverage. Second, we can add facts
nate punctuation and stop words. Pattern identification in the data                                    (triples) that were missing in YAGO. Third, we can add the relation
values of the same type could increase the probability of extracting                                   types that did not align. For instance, we can add information
cleaner values. We are working on a way to index YAGO literals in                                      about the publisher of a book to YAGO. This latter direction is
a manner that is resilient to the small differences we wish to ignore.                                 more challenging, because we need to determine if the relation
A promising approach to do this is shingling [7].                                                      types contain valuable information. One safe way to deal with this
   Second, an entity-to-literal relation in the labeled graph may not                                  relevance problem is to require attribute values to be mapped to a
necessarily correspond to a single edge in the reference ontology,                                     form element in the input schema. We can then use the label of the
but to a sequence of edges. This amounts to a join of the involved                                     element to annotate them.
relations; a typical case in our prototype is the “author” attribute
which is linked to a record entity through a two-step YAGO path
“y:created y:hasPreferredName”. To ensure that the alignment with                                      4.        PRELIMINARY EXPERIMENTS
joins, typically costly, can be performed in practice, we limit the                                       We have prototyped this approach for the Amazon book advanced
maximal length of joins. A consequence is that PARIS will explore                                      search form1 . Obviously, we cannot claim any statistical signifi-
a smaller fraction of YAGO in the search for relations relevant to the                                 cance of the results we report here, but we believe that the approach,
data of our labeled graph. In addition to the use of record attribute                                  because it is generic, can be successfully applied to other sources of
values as literals, PARIS could use the form labels (through the                                       the deep Web.
input–output mappings) to guide the alignment and favor YAGO                                              Our preliminary implementation performed agnostic probing of
relations with a similar name. Some record instances do not align                                      the form, wrapper induction, and mapping of input–output schemas.
with any literal of the ontology. The cause is that they represent                                     It generated a labeled graph with 93 entities and 10 relation types
information which is unknown to YAGO.                                                                  out of which 2 (title and author) are recognized by YAGO. Literals
                                                                                                       underwent a semi-heuristic normalization process (lowercasing, re-
Form Understanding and Ontology Enrichment.                        On-                                 moval of parenthesized substrings). We then replaced each extracted
tology alignment gives us knowledge about the data types, the do-
mains and ranges of record attributes, and their relation to the object
                                                                                                        1 http://www.amazon.com/gp/browse.html?node=241582011
literal with a similar literal in YAGO, if the similarity (in terms of the   [10] T. Furche, G. Gottlob, G. Grasso, X. Guo, G. Orsi, and
number of common 2-grams) was higher than an arbitrary threshold.                 C. Schallhart. Real understanding of real estate forms. In Proc.
   We aligned this graph with YAGO by running PARIS for 15 iter-                  WIMS, 2011.
ations, i.e., a run time of 7 minutes (most of it was spent loading          [11] T. Furche, G. Grasso, G. Orsi, C. Schallhart, and C. Wang.
YAGO, the proper computation took 20 seconds). Though the vast                    Automatically learning gazetteers from the deep Web. In Proc.
majority of the books from the dataset were not present in YAGO,                  WWW, 2012.
the 6 entity alignments with best confidence were books that had             [12] B. He, K. C.-C. Chang, and J. Han. Discovering complex
been correctly aligned through their title and author. To limit the               matchings across Web query interfaces: A correlation mining
effect of noise on relation alignment, we recomputed relation align-              approach. In Proc. KDD, 2004.
ments on the entity alignments with highest confidence; the system           [13] S. Kumar, A. K. Yadav, R. Bharti, and R. Choudhary.
was thus able to properly align the title and author relations with               Accurate and efficient crawling the deep Web: Surfacing
“y:hasPreferredName” and “y:created y:hasPreferredName”, respec-                  hidden value. International J. Computer Science and
tively. These relations were associated to the record internal paths              Information Security, 9(5), 2011.
of the output schema attributes and propagated to form input fields.         [14] G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and
                                                                                  searching Web tables using entities, types and relationships.
5.    DISCUSSION                                                                  Proc. VLDB, 3(1), 2010.
  Our vision is that of a holistic system for deep Web understanding         [15] S. Nestorov, S. Abiteboul, and R. Motwani. Extracting
and ontology enrichment, where each stage of the process (form                    schema from semistructured data. In ACM International
analysis, information extraction, schema matching, ontology align-                Conference on Management of Data (SIGMOD 1998), 1998.
ment, etc.) would benefit of every other part. This is an ambitious          [16] M. Oita and P. Senellart. Own work undergoing double-blind
project, but our current prototype already exhibits promising results.            reviewing, 2012.
  Many challenges remain to be tackled: resilience to outliers and           [17] Resource Description Framework (RDF): Concepts and
noise resulting from imperfect literal matching and information                   abstract syntax. W3C Recommendation. http://www.w3.
extraction; proper management of the confidence in the results of                 org/TR/2004/REC-rdf-concepts-20040210/.
each automatic task, especially when they are used as the input of
                                                                             [18] C. Reynaud and B. Safar. Exploiting WordNet as background
another task; identification of new relation types of interest among
                                                                                  knowledge. In Proc. ISWC Ontology Matching (OM-07)
those extracted from a Web source; integration of the information
                                                                                  Workshop, 2007.
contained in several different deep Web sources of the same domain.
                                                                             [19] P. Senellart, A. Mittal, D. Muschick, R. Gilleron, and
                                                                                  M. Tommasi. Automatic wrapper induction from hidden-Web
Acknowledgments                                                                   sources with domain knowledge. In Proc. WIDM, 2008.
We acknowledge Fabian Suchanek for initial discussions on this               [20] G. Stoilos, G. B. Stamou, and S. D. Kollias. A string metric
topic. The research has been funded by the European Union’s sev-                  for ontology alignment. In Proc. ISWC, 2005.
enth framework programme, in the setting of the European Research            [21] W. Su, J. Wang, and F. H. Lochovsky. ODE:
Council grant Webdam, agreement 226513, and the FP7 grant AR-                     Ontology-assisted data extraction. ACM Trans. Database
COMEM, agreement 270239.                                                          Syst., 34(2), 2009.
                                                                             [22] F. M. Suchanek, S. Abiteboul, and P. Senellart. PARIS:
6.    REFERENCES                                                                  Probabilistic alignment of relations, instances, and schema.
 [1] M. Alvarez, A. Pan, J. Raposo, F. Bellas, and F. Cacheda.
                                                                                  Proc. VLDB Endow., 5(3), 2011.
     Extracting lists of data records from semi-structured Web
                                                                             [23] F. M. Suchanek, G. Kasneci, and G. Weikum. YAGO: A core
     pages. Data and Knowledge Engineering, 64(2), 2008.
                                                                                  of semantic knowledge unifying WordNet and Wikipedia. In
 [2] Y. J. An, S. A. Chun, K.-C. Huang, and J. Geller. Enriching                  Proc. WWW, 2007.
     ontology for deep Web search. In Proc. DEXA, 2008.
                                                                             [24] M. Thiam, N. Pernelle, and N. Bennacer. Contextual and
 [3] Y. J. An, J. Geller, Y.-T. Wu, and S. A. Chun. Semantic deep                 metadata-based approach for the semantic annotation of
     Web: automatic attribute extraction from the deep Web data                   heterogeneous documents. In Proc. SeMMA, 2008.
     sources. In Proc. SAC, 2007.
                                                                             [25] N. Tiezheng, Y. Ge, S. Derong, K. Yue, and L. Wei.
 [4] R. Balakrishnan and S. Kambhampati. SourceRank:
                                                                                  Extracting result schema based on query instances in the deep
     Relevance and trust assessment for deep Web sources based                    Web. Wuhan University J. Natural Sciences, 12(5), 2007.
     on inter-source agreement. In Proc. WWW, 2011.
                                                                             [26] J. Wang and F. H. Lochovsky. Data extraction and label
 [5] L. Barbosa and J. Freire. Siphoning hidden-Web data through                  assignment for Web databases. In Proc. WWW, 2003.
     keyword-based interfaces. J. Information and Data
                                                                             [27] J. Wang, J.-R. Wen, F. Lochovsky, and W.-Y. Ma.
     Management, 1(1), 2004.
                                                                                  Instance-based schema matching for Web databases by
 [6] E. Beisswanger. Exploiting relation extraction for ontology                  domain-specific query probing. In Proc. VLDB, 2004.
     alignment. In Proc. ISWC, 2010.
                                                                             [28] R. C. Wang and W. W. Cohen. Language-independent set
 [7] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig.
                                                                                  expansion of named entities using the Web. In Proc. ICDM,
     Syntactic clustering of the Web. Computer Networks,                          2007.
     29(8-13), 1997.
                                                                             [29] W. Wu, A. Doan, C. Yu, and W. Meng. Bootstrapping domain
 [8] J. Caverlee, L. Liu, and D. Buttler. Probe, cluster, and                     ontology for semantic Web services from source Web sites. In
     discover: Focused extraction of QA-pagelets from the deep                    Proc. VLDB Workshop on Technologies for E-Services, 2005.
     Web. In Proc. ICDE, 2004.
                                                                             [30] X. Yuan, H. Zhang, Z.-Y. Yang, and Y. Wen. Understanding
 [9] P. Cimiano, G. Ladwig, and S. Staab. Gimme’ the context:                     the search interfaces of the deep Web based on domain model.
     Context-driven automatic semantic annotation with
                                                                                  In Proc. ICIS, 2009.
     C-PANKOW. In Proc. WWW, 2005.