Contextual Data Extraction and Instance-Based Integration

                             Lorenzo Blanco Valter Crescenzi Paolo Merialdo Paolo Papotti
                                                      Dipartimento di Informatica ed Automazione
                                                            Università degli Studi Roma Tre
                                           blanco,crescenz,merialdo,papotti@dia.uniroma3.it


ABSTRACT                                                                                   shown in Figure 1 comes from a different source and pub-
We propose a formal framework for an unsupervised approach tack-                           lishes information about a single company stock.
ing at the same time two problems: the data extraction problem, for                     • global information redundancy: at the web scale many sources
generating the extraction rules needed to gain data from web pages,                       provide similar information. The redundancy occurs both a
and the data integration problem, to integrate the data coming from                       the schema level (the same attributes are published by more
several sources. We motivate the approach by discussing its advan-                        than one source) and at the instance level (some objects are
tages with regard to the traditional “waterfall approach”, in which                       published by more than one source). In our example, many
data are wholly extracted before the integration starts without any                       attributes are present in all the sources (e.g., the company
mutual dependency between the two tasks.                                                  name, last trade price, volume); while others are published
   In this paper, we focus on data that are exposed by structured and                     by a subset of the sources (e.g., the “Beta” indicator). At the
redundant web sources. We introduce novel polynomial algorithms                           extensional level, there is a set of stock quotes that are pub-
to solve the stated problems and present theoretical results on the                       lished by more sources. As web information is inherently
properties of the solution generated by our approach. Finally, a                          imprecise, redundancy also implies inconsistencies; that is,
preliminary experimental evaluation show the applicability of our                         sources can provide conflicting information for the same ob-
model with real-world websites.                                                           ject (e.g., a different value for the volume of a given stock).

                                                                                        These observations lead us to focus on pages that are published
1.     INTRODUCTION                                                                  following the one-tuple-per-page pattern: in each structured page
   The development of scalable techniques to extract and integrate                   you can find information about a single tuple. If we abstract this
data from fairly structured large corpora available on the web is a                  representation, we may say that a collection of structurally similar
challenging issue, because the web scale imposes the use of un-                      pages provided by the same site corresponds to a relation. Accord-
supervised and domain independent techniques. To cope with the                       ing to this abstraction, the websites for pages in Figure 1 expose
complexity and the heterogeneity of web data, state-of-the-art ap-                   their own version of the “StockQuote” relation.1
proaches focus on information organized according to specific pat-                      Starting from the crawled web pages (for instance by using the
terns that frequently occur on the web. Meaningful examples are                      specialized crawler introduced in [3]), our goal is: (i) transform the
presented in [6], which focuses on data published in HTML tables,                    web pages coming from each source into a relation, and (ii) inte-
and information extraction systems, such as TextRunner [1], which                    grate these relations creating a database containing the information
exploits lexical-syntactic patterns. As noticed in [6], even if a small              provided by all the sources. A state-of-the-art solution to this prob-
fraction of the web is organized according to these patterns, due to                 lem is a two-steps waterfall approach, where a schema matching
the web scale the amount of data involved is impressive: in their                    algorithm is applied over the relations returned by automatically
case, more than 154 millions tables were extracted from 1.1% of                      generated wrappers. However, when a large number of sources is
the considered pages.                                                                involved and a high level of automation is required, important is-
   In large data-intensive websites, we observe two important char-                  sues may arise:
acteristics that suggest new opportunities for the automatic extrac-
tion and integration of web data:                                                       • Wrapper Inference Issues: since wrappers are automatically
                                                                                          generated by an unsupervised process, they can produce im-
     • local regularities: in these sites, large amounts of data are                      precise extraction rules (e.g., rules that extract irrelevant in-
       usually offered by thousands of pages, each encoding one                           formation mixed with data of the domain). To obtain correct
       tuple in a local HTML template. For example, each page                             rules, the wrappers should be evaluated and refined manually.
                                                                                        • Integration Issues: the relations extracted by automatically
                                                                                          generated wrappers are “opaque”, i.e., their attributes are
Permission to make digital or hard copies of all or part of this work for                 not associated with any (reliable) semantic label. Therefore
personal or classroom use is granted without fee provided that copies are                 the matching algorithm must rely on an instance-based ap-
not made or distributed for profit or commercial advantage and that copies                proach, which considers attribute values to match schemas.
bear this notice and the full citation on the first page. To copy otherwise, to      1
republish, to post on servers or to redistribute to lists, requires prior specific     For the sake of simplicity, we consider only the one-tuple-per-
permission and/or a fee. This article was presented at the workshop Very             page pattern. Other variations of this pattern can be easily devel-
Large Data Search (VLDS) 2011.                                                       oped for example by preprocessing the pages with tools that frag-
Copyright 2011.                                                                      ment the HTML tables into rows [7].
                                                                          over the hidden relation T .
                                                                             The attributes published by a source are called called physical
                                                                          attributes, as opposed to the conceptual attributes of T , and we
                                                                          write S(a) to denote that a source S publishes a physical attribute
                                                                          a, and a ∈ A to state that a physical attribute a contains data from
                                                                          a conceptual attribute A.
                                                                             A source publishes information about a subset of the conceptual
                                                                          instances, and different sources may publish different subsets of its
                                                                          conceptual attributes.
                                                                             To model the presence of conflicting data that usually occur among
                                                                          redundant sources, we assume that sources are noisy: they may in-
                                                                          troduce errors, imprecise or null values, over the data picked from
                                                                          the hidden relation. As depicted in Figure 2, for each source Si we
                                                                          can abstract the page generation process as the application of the
                                                                          following operators over the hidden relation:
Figure 1: Two web pages containing data about stock quotes
from Reuters and Google finance websites.                                    • Selection σi : returns a relation containing a subset of the
                                                                               conceptual instances, σ(I) ⊆ I.

      However, due to errors introduced by the publishing process,           • Projection πi : returns a relation containing a subset of the
      instance-based matching is challenging because the sources               conceptual attributes, π(A) ⊆ A.
      may provide conflicting values. Also, imprecise extraction             • Error ei : is a function that returns a relation, such that each
      rules return wrong, and thus inconsistent, data.                         correct value is kept or replaced with a null value, a synthetic
                                                                               value, or a value similar to the correct one.
   In [2] we presented a best-effort solution to solve these issues
by taking advantage of the mutual coupling between the wrapper               • Encode λi : is an encoding function that produces a web page
inference and the data integration tasks. In the present paper, we             for each tuple by embedding its values into a HTML tem-
investigate the foundations of the problem and propose a princi-               plate.
pled solution based on the following contributions: (i) we propose
a formal setting to state the data extraction and the data integra-          The set of pages published by a source Si can be thought as a
tion problems for redundant and structured web sources; (ii) we           view over the hidden relation, obtained by applying the above oper-
formulate a set of hypothesis that capture a few natural constraints      ators as follows: Si = λi (ei (πi (σi (T )))). From this perspective,
that characterize this kind of web sources; (iii) we propose novel        the extraction and the integration problems can be thought in terms
unsupervised polynomial algorithms to solve the stated problems           of these operators. The extraction becomes the inversion of the λ
whenever these hypotheses hold; (iv) we present an experimental           operator. That is, obtaining for each source Si the associated rela-
evaluation of our model with real-world websites.                         tion Vi = ei (πi (σi (T ))). The integration becomes the problem of
   In the next section we describe a generative model of the web          reconstructing T from the views associated with the sources.
pages to introduce our formal setting of structured and redundant            Notice that both problems are far from being trivial as the state-
sources. Section 3 contains the algorithms to solve the integra-          of-the-art automatic wrapper inference systems are not able to cre-
tion and extraction problems in the case of perfectly overlapping         ate perfect wrappers, and the integration task is further complicated
sources. In Section 4 we show how removing this hypothesis af-            by the presence of errors and the absence of reliable semantic la-
fects our solution, while in Section 5 we discuss preliminary ex-         bels.
periments with a set of sources gathered from the Web. Section 6             In the following we discuss under which assumptions on the in-
discusses related works and concludes the paper.                          tensional and extensional redundancy exhibited by the sources, our
                                                                          approach is able to deal with a bounded amount of error.
2.    THE GENERATIVE MODEL                                                2.1    Intensional Redundancy
We are interested in extracting and integrating all the available in-     We now discuss two properties of the generative model. The first
formation about a target entity, such as the S TOCK Q UOTE entity         property expresses that the data published by each source are lo-
of our running example. As on the Web several sources publish             cally consistent. That is, within the physical attributes published by
information about the same entity, we can imagine that there exists       the same source, there cannot be distinct attributes with the same
a hidden relation T , which contains all the true information about       semantics. For example, if a website states that the current value
the objects that belong to the entity, and that sources generate their    of the stock quote “YHOO” is 17.01 there cannot be another place
pages by taking data from T .                                             in the same site where you can find a different value with the same
   We call conceptual instances the set of tuples I of the relation T .   semantics. Therefore, we can write:
Each tuple I ∈ T represents a real-world object of the target entity
of interest. For example, in the case of the S TOCK Q UOTE entity,           P ROPERTY 1. Local consistency:
the conceptual instances of T model the data about the Apple stock        ∀ai , aj ∈ A : S(ai ) = S(aj ) ⇒ ai = aj
quote, the Yahoo! stock quote, and so on. T has a set of attributes       (in a conceptual attribute A there cannot exist two physical at-
A called conceptual attributes. In our example they represent the         tributes coming from the same source).
attributes associated with a stock quote, such as the company name,
the current trade price, the volume, and so on.                              The second property formalizes the presence of redundancy at
   Given a set of sources S = {S1 , . . . , Sm }, each source Si , i =    the intensional level. Namely, we assume that every possible pair
1 . . . m can be seen as the result of a generative process applied       of conceptual attributes is published at least by one source. Let
          Figure 2: The publishing process: the web sources are views over the hidden relation generated by four operators.


S(Ai ) denotes a predicate that returns true if the source S publishes      Given a set of sources S, each Si publishes a view of the hid-
the conceptual attribute Ai . Therefore, a set of sources S is called     den relation T such that Vi = ei (πi (σi (T ))). The integration
intensionally redundant if the following property holds:                  problem can be thought as the creation of sets of physical attributes
                                                                          m1 , . . . , mn , called mappings, such that each attribute a belongs to
   P ROPERTY 2. Intensional redundancy:                                   a mapping m and each mapping contains all and only the attributes
   ∀Ai , Aj : i 6= j ∃ S : S(Ai ) ∧ S(Aj )                                with the same semantics. The problem can be defined as follows:
(every possible pair of conceptual attributes is published at least
by one source)                                                               P ROBLEM 1. Integration Problem : given a set of source views
                                                                          V = V1 , . . . , Vn , where Vi = ei (πi (σi (T ))), find a set of map-
   Notice that given any set of websites, this property may hold
                                                                          pings M such that M = {m : a1 , a2 ∈ m ⇔ a1 , a2 ∈ A}.
only for appropriately chosen subsets. However, the web scale rep-
resents an opportunity to gain enough redundancy: (i) in each do-
                                                                             Intuitively, we solve the problem by evaluating the physical at-
main, usually there is a subset of attributes that is published by
                                                                          tributes of each source and by building aggregations of attributes
most of the sources;2 (ii) as the number of considered websites
                                                                          with the same semantics from the sources. If at the end of the pro-
increases, the probability of meeting a new conceptual attribute de-
                                                                          cess each mapping contains all and only the physical attributes with
creases, and the probability of intensionally redundancy increases.
                                                                          the same semantics, we have a solution for the problem. For exam-
   In the following we call core-attributes the attributes for which
                                                                          ple, given a1 , a3 ∈ V1 and a2 ∈ V2 with a1 , a2 having the same
the intensional redundancy holds, and we call rare-attributes all the
                                                                          semantics, a solution is m1 = {a1 , a2 } and m2 = {a3 }.
others.
                                                                             If the sources publish correct data only, then a naive greedy al-
2.2    Extensional Redundancy                                             gorithm easily solves the problem above. However, real sources
                                                                          introduce noise in the values (modeled by the error function e) that
   At the extensional level, we assume that sources publish a com-
                                                                          can make the integration difficult or even not possible.
mon set of instances. For the sake of presentation, in the next sec-
                                                                             To identify physical attributes with the same semantics, we rely
tion we rely on the hypothesis that all the source publish the same
                                                                          on a distance function d(ai , aj ) among the values of a set of in-
set of instances I. Later, in Section 4 we discuss how to remove
                                                                          stances corresponding to the same real-world object.3 This func-
this hypothesis.
                                                                          tion compares aligned values and returns a score between 0 and 1.
                                                                          The more similar are the values, the lower is the distance. As the
3.    EXTRACTION AND INTEGRATION AL-                                      distance function works by comparing values of aligned instances,
      GORITHMS                                                            it can be easily extended to work also on conceptual attributes. We
Our solutions to the extraction and integration problems are dis-         denote d(a, A) the distance between the values of the physical at-
cussed in this section: we propose the problem statements, the defi-      tribute a and the values of the conceptual attribute A.
nition of solutions and the algorithms that solve them in polynomial         In the following, we study a class of error functions for which
time.                                                                     our algorithm can compute a solution by bounding the amount of
                                                                          errors that sources are allowed to introduce.
3.1    Integration Problem                                                   For every conceptual attribute A, let tA denote a minimal thresh-
We first discuss the integration problem by ignoring the extraction       old such that any physical attribute ai belongs to A if for each
issues, then we discuss the main properties of the extraction rules,      aj ∈ A, d(ai , aj ) < tA . For example, in the finance domain,
and, finally, how the integration and the extraction problems can         a low threshold should be associated with the conceptual attribute
be tackled together. Therefore, for the time being we assume we           Max value of a stock quote. This is required as there are other
have a wrapper generator that is capable of perfectly inverting the       conceptual attributes, like the current Price and the Min value, that
encode operator λ. In other words, we do not work on web pages,           have similar values. On the other hand, the mapping for the trading
but directly on the views of T published by the sources.                  Volume conceptual attribute can have an higher threshold since it
2                                                                         3
  In [11] the authors write “there is a core set of attributes that ap-     We assume that instances can be aligned by applying a standard
pear in a large number of items”.                                         record-linkage technique [9].
does not usually assume values close to those of other attributes.            We consider as given an unsupervised wrapper generator sys-
Note that tA is an ideal unknown threshold: it is not given as input      tem W. A wrapper w is an ordered set of extraction rules, w =
of the integration problem and it is not necessary to know it a priori    {er1 , . . . , erk }, that apply over a web page: each rule extracts a
to compute the solution.                                                  string from the HTML of the page. We denote er(p) the string re-
   In order to solve the integration problem, it is required that the     turned by the application of the rule er over the page p. The appli-
publishing errors cannot introduce enough noise to confuse the se-        cation of a wrapper w over a page p, denoted w(p), returns a tuple
mantics of a physical attribute. We call this property as separable       t = her1 (p), . . . , erk (p)i; therefore, the application of a wrapper
semantics:                                                                over the set of pages of a source S returns a relation w(S), which
                                                                          has as many attributes as the number of extraction rules of the wrap-
    P ROPERTY 3. Separable semantics:                                     per. A column of the relation is a vector of values denoted V (eri ):
   ∀A1 , A2 , ai ∈ A1 , aj ∈ A2 : ai 6= aj ∧ A1 6= A2 ⇒                   it is the result of the application of an extraction rule eri over the
d(ai , aj ) > max(tA1 , tA2 )                                             pages of a source.
(it is possible to distinguish the semantics of physical attributes).         We say that an extraction rule er∗ is correct if for every given
                                                                          page it extracts a value of the same conceptual attribute (i.e., target
  In order to solve the integration problem with noisy sources, we
                                                                          values with the same semantics) or a null value if the value for the
define the greedy clustering Algorithm 1.
                                                                          attribute is missing in that page. If a correct extraction rule only
                                                                          extracts noise values, it is considered noisy. We also say that an
Algorithm 1 A BSTRACT I NTEGRATION
                                                                          extraction rule erw is weak if it mixes either target values with
Input: A set of locally consistent, intensionally redundant sources       different semantics or target values with noise values.
with separable physical attributes.                                           Unsupervised wrapper generators are powerful enough to infer
Output: The correct set of mapping M.                                     the correct extraction rules needed to cover the data exposed by
                                                                          what we call regularly structured websites.
Let G = (N, E) be a graph where every attribute ai for every
source Si ∈ S is a node n ∈ N . For every pair of distinct nodes             P ROPERTY 4. Regularly structured sources: The sources S =
ai , aj ∈ N such that S(ai ) 6= S(aj ) add an edge e between them         {Si } are regularly structured w.r.t. a given unsupervised wrapper
to E and let d(ai , aj ) be the weight of e.                              generator system W, if W generates for each source Si ∈ S a set
Let m(ai ) be the mapping containing the attribute ai .                   of rules wi containing all the correct rules.

   1. Add to M a mapping m = {ai } for each node ni ∈ N ,                    However, wrapper generators cannot automatically identify, among
                                                                          the generated rules, which are the correct ones. They also produce
   2. insert in a list L the edges E,                                     weak rules, since, at wrapper generation time there is not enough
                                                                          information to automatically establish if a rule is either correct or
   3. sort L by the weight of the edges in ascending order,               weak. The integration algorithm has been presented considering
   4. for each edge (a1 , a2 ) ∈ L:                                       only the correct rules (i.e., physical attributes). However, noisy
                                                                          rules, if considered along with the correct ones, are harmless as
        (a) let m be the union of the attributes in m(a1 ) and m(a2 )     they can be identified and deleted during the integration step. They
                                                                          will eventually generate singleton mappings of size one since the
        (b) if in m there is no pair of ai , aj such that S(ai ) =        distance between a noisy rule and a correct rule prevent them from
            S(aj )                                                        grouping. Similar arguments apply for distances amongst noisy
        (c)   then add m and remove m(a1 ), m(a2 ) from M                 rules.
        (d)   else break.                                                    Weak rules require a more detailed discussion, and unfortunately
                                                                          they cannot be identified and disregarded at wrapper generation
                                                                          step. In the following we show that, if we keep the same assump-
  We are now ready to prove the correctness of the integration al-        tions introduced for the integration problem, we can always identify
gorithm.                                                                  weak rules during the integration step.

   L EMMA 3.1. A BSTRACT I NTEGRATION is correct.                         3.3    Extraction Problem
                                                                          The extraction problem is defined as follows:
   P ROOF. Moved to Appendix A.
                                                                             P ROBLEM 2. Extraction Problem: given a set of sources S =
   A BSTRACT I NTEGRATION is O(n2 ) over the total number of              {Si }, produce a set of wrappers W ∗ = {wi }, such that wi con-
physical attributes, in fact most of the time is required to create       tains all and only the correct rules for Si .
the edges of the graph G.
   In the following we introduce the extraction problem, that is,            We now describe how we leverage the redundant information
how to get the physical attributes we considered as input of the          among different sources to identify and filter out the weak rules.
integration problem.                                                      Let eri and erj be two extraction rules. We say that two extraction
                                                                          rules “overlap” if they extract from a page the same occurrence of
3.2    Extraction Rules                                                   the same string. In this case, one of them must be a weak rule. In
In our framework, a data source S is a collection of pages S =            other terms, if many rules are extracting the same value occurrence
p1 , . . . , pn from the same website, such that each page publishes      from at least a page, only one of them is a correct rule and all the
information about one object of a real-world entity of interest.          others are weak ones. With an abuse of notation, we will say that
   We distinguish between two different types of values that can          er ∈ A to state when an extraction rule extracts at least a correct
appear in a page: target values, that is, values that are derived from    value of the conceptual attribute A. Notice that, as a weak rule
the hidden relation T , and noise values, that is, values that are not    erw can extract values from n conceptual attributes, we can say
of interest for our purpose (e.g., advertising, template, layout, etc).   erw ∈ A1 , . . . , An .
                                                                          4.    NON-OVERLAPPING SOURCES
                                                                             So far we have simplified the discussion by hypothesizing that
                                                                          every source publishes data about every object in I. In this section
                                                                          we remove this simplification and use IiA to denote the subset of I
                                                                          for which Si provides values of the conceptual attribute A.
                                                                             The distances amongst physically attributes from several sources
                                                                          have been computed using an instance-based metric that relies on
         Figure 3: The Extraction algorithm in action.                    the availability of a set of shared instances between the involved
                                                                          sources. Therefore, if we want to compute the direct distance be-
                                                                          tween two attributes d(ai , aj ), with S(ai ) = Si and S(aj ) = Sj ,
  These intuitions are applied in the following greedy algorithm          we need a non-empty overlap of objects between Si and Sj (Si 6=
which solves the extraction problem:                                      Sj ), otherwise we consider d(ai , aj ) = ∞.
                                                                             To formalize this aspect, and given a positive integer parameter
Algorithm 2 A BSTRACT E XTRACTION                                         q, let OVq,A (Si , Sj ) be a predicate true iff |IiA ∩ IjA | ≥ q, i.e.
Input: A set of locally consistent, intensionally redundant sources       both Si and Sj publish a value of the attribute A for a shared set of
with separable physical attributes; the set of wrappers W produced        at least q instances.
by a wrapper generator system W w.r.t. which the sources are reg-            Intuitively, OVq,A (Si , Sj ) is true if we consider Si and Sj to
ularly structured.                                                        share enough instances (at least q) to be directly comparable on A’s
Output: A set of wrappers W ∗ that do not contain weak rules.             values. The value of q has to be chosen according to a trade-off:
                                                                          the higher the value, the more reliably the instance-based distance
   1. while there is a er ∈ W which is not marked as correct:             would perform, as it can be computed over a larger set of shared
                                                                          instances; however, it is possible that a lower number of sources
        (a) let d(V (eri ), V (erj )) be the minimal distance between     will be directly comparable.
            the values of two extraction rules in W and at least one         We are now ready to tackle the main issue: how to the compute
            of them is not marked as correct                              the distance when the direct distance is not defined? i.e., the overlap
        (b) mark eri and erj as correct, (they are correct rules)         is not sufficient or not available at all and OVq,A (Si , Sj ) does not
        (c) remove from W all the rules that overlaps with eri            hold.
            (they are weak rules)                                            We introduce the indirect distance d by leveraging the inter-
                                                                          mediate sources sharing instances with two sources not directly
        (d) remove from W all the rules that overlaps with erj            having enough overlap. Given a third source Sw , such that both
            (they are weak rules)                                         OVq,A (Si , Sw ) and OVq,A (Sw , Sj ) hold, as for the shortest path
   2. now W is W ∗ .                                                      among two points is a straight line, we can easily write: d(ai , aj ) ≤
                                                                          d(ai , aw ) + d(aw , aj ). In this case, we have an upper bound for
                                                                          d(ai , aj ) that we call indirect distance, based on the availability of
   The algorithm A BSTRACT E XTRACTION takes as input a set of            two direct distances between (Si , Sw ) and between (Sw , Sj ).
wrappers W and computes W ∗ which does not contain weak rules.               In the previous example we used just one intermediate source
To explain the algorithm, we rely on the example in Figure 3. Con-        (Sw ); the same principle can be trivially extended to a generic
sider two websites and two objects (say, two stock quotes) that are       number of intermediate sources. However, the more intermedi-
published by both sites. In the figure, a circle represents the values    ate sources are involved, the less precise is the bound imposed by
extracted by an extraction rule and its number represents the web-        d(ai , aj ). In the case that we have multiple possible indirect dis-
site is has been executed on. For example, in the diagram on the          tances, the bound chosen is the smallest one.
                                                                                            ∗
left, the dark circle marked with 1 extracts from the first website the      We call OVq,A       the transitive closure of OVq,A : OVq,A (Si , Sj )
values 10 for the first object and 11 for the second object. Notice       is true iff it is possible to compute a distance (direct or indirect)
that the input of the algorithm are the circles annotated with the ex-    between Si , Sj for the attribute A. If two attributes ai and aj
tracted values and the website of provenience. Let say now that in        are not comparable over the same conceptual attribute A, that is,
                                                                               ∗
step (a) the algorithm identifies the dark circles as the closest ones,   OVq,A   (Si , Sj ) is false, we set d(ai , aj ) = ∞.
and mark them as correct in step (b). At this point both the circles         Let S(A) denote the set of websites in S publishing values of
with horizontal stripes overlap with correct rules and can therefore      the conceptual attribute A ∈ A : a set of websites S are called
be removed in steps (c) and (d). In the diagram on the right we           extensionally redundant if the following property holds:
show the resulting scenario: the dark circles are now the closest             P ROPERTY 5. Extensional redundancy:
rules and are marked as correct. The remaining dashed circles do          ∀A ∈ A OVq,A   ∗
                                                                                            = S(A) × S(A)
not match at all (i.e., they are noisy rules) and raise the creation      ( the overlap of websites’ objects allows the computation of indirect
of two singleton mappings. To prove that the above algorithm is           distances )
correct we introduce the following lemma:
                                                                          Essentially, it is required that the indirect-distance d can be com-
   L EMMA 3.2. A BSTRACT E XTRACTION is correct.                          puted between any pair of physical attributes.
                                                                             Observe that analogously to rare-attributes, a concept of rare-
   P ROOF. Moved to Appendix A.                                           instances could be introduced. Anyway, the crawler [3] used during
                                                                          the experiments gathers up extensionally redundant websites.
   A BSTRACT E XTRACTION is O(n2 ) over the total number of ex-              All the results previously obtained continue to hold with the
traction rules generated by the automatic wrapper generation sys-         new definition of distance, and can be applied even in presence of
tem. Like in the case of A BSTRACT I NTEGRATION, most of the              sources that do not contain all the objects I of the hidden relation
time is spent computing distances between the extracted values.           T provided that the input sources are extensional redundant. How-
ever, in order to obtain a solution with indirect distances, we in-       out considering the redundancy of information that occurs at the
crease the complexity of the algorithms from quadratic to cubic, as       instance-level.
we reduce the computation of the distance function to the problem            The exploitation of structured web data is the primary goal of
of finding shortest paths in a graph by modeling physical attributes      WebTables [6] and ListExtract [10], which concentrate on data pub-
as nodes and the distances among them as weighted edges. This             lished in HTML tables and lists, respectively. Compared to infor-
can be solved with the Floyd–Warshall algorithm [14].                     mation extraction approaches, WebTables and ListExtract extract
                                                                          relations with involved relational schemas but it does not address
                                                                          the issue of integrating the extracted data.
5.    EXPERIMENTAL EVALUATION                                                Cafarella et al. have described a system to populate a probabilis-
In this section we present a preliminary experimental evaluation          tic database with data extracted from the web [4]. However, the
of our model, conducted by using a special-purpose crawler [3] to         data are retrieved by TextRunner [1], an information extraction sys-
collect 100 websites over three application domains: Soccer, Video-       tem that is not targeted to data rich web pages as ours. Octopus [5]
games, and Finance. Each source consists of tens to thousands             and Cimple [13] support users in the creation of data sets from web
of pages, and each page contains detailed data about one object           data by means of a set of operators to perform search, extraction,
of the corresponding entity. For each domain, we then selected            data cleaning and integration. Although such systems have a more
the 20 largest sources and manually verified the hypothesis of our        general application scope than ours, they involve users in the pro-
generative process.                                                       cess, while our approach is completely automatic.
   Intensional Redundancy We start evaluating the redundancy at
the schema level. We observe that in the soccer and video-game do-
mains the majority of the conceptual attributes are not rare. In fact,    7.    REFERENCES
in the soccer domain, only 25% of the attributes are rare and they
                                                                           [1] M. Banko, M. Cafarella, S. Soderland, M. Broadhead, and
mostly come from websites that are not only about soccer players.
                                                                               O. Etzioni. Open information extraction from the web. In
For example, a website containing info about olympic athletes ex-
                                                                               IJCAI, 2007.
poses the attribute medals, while a club website exposes the debut
date only for the players coming from its own youth academy. For           [2] L. Blanco, M. Bronzi, V. Crescenzi, P. Merialdo, and
the video-games the percentage of rare attributes is slightly over             P. Papotti. Redundancy-driven web data extraction and
30% of the total, in this case rare attributes come from distinct in-          integration. In WebDB, 2010.
formation with very similar semantics, such as difficulty and learn-       [3] L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti.
ing curve. Finally, in the stock quote scenario the percentage of              Supporting the automatic construction of entity aware search
rare attributes is over 40% of the total. This is not surprising, as           engines. In WIDM, pages 149–156, 2008.
in financial domain there is a large number of attributes (89 for 20       [4] M. J. Cafarella, O. Etzioni, and D. Suciu. Structured queries
sources) due to the presence of many indicators used for technical             over web text. IEEE Data Eng. Bull., 29(4):45–51, 2006.
analysis. For this domain, 20 sites are sufficient only to get a very      [5] M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data
rough estimation of the set of attributes published by the sites of            integration for the relational web. PVLDB, 2(1):1090–1101,
the domain. We expect that the percentage of rare attributes would             2009.
significantly drop as the number of web sources increases.                 [6] M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and
   Extensional Redundancy Within the same domain, several ob-                  Y. Zhang. Webtables: exploring the power of tables on the
jects are shared by several sources. The overlap is almost total for           web. PVLDB, 1(1):538–549, 2008.
the stock quotes, while it is more articulated for soccer players and      [7] L. Chen, S. Ye, and X. Li. Template detection for large scale
video-games as these domains include both large popular sites and              search engines. In Proceedings of the 2006 ACM symposium
small ones. Over the 100 sources, we computed that each soccer                 on Applied computing, SAC ’06, pages 1094–1098, New
player object appears on average in 1.6 sources, each video-game               York, NY, USA, 2006. ACM.
in 24.5 sources, and each stock quote in 92.8 sources. In particular,      [8] S.-L. Chuang, K. C.-C. Chang, and C. X. Zhai.
OVq,A (Si , Sj ) is true with q = 5 for all the non rare attributes for        Context-aware wrapping: Synchronized data extraction. In
all the websites.                                                              VLDB, pages 699–710, 2007.
   Errors and Thresholds We manually verified the thresholds for           [9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios.
the conceptual attributes of the three domains. As expected, those             Duplicate record detection: A survey. IEEE Trans. Knowl.
have very different values, depending on the domain and the at-                Data Eng., 19(1):1–16, 2007.
tribute considered. As an example, in the finance domain a thresh-        [10] H. Elmeleegy, J. Madhavan, and A. Y. Halevy. Harvesting
old of 0.023 is needed for the Max value of a stock quote (0.029 for           relational tables from lists on the web. PVLDB,
the Min), while for the Volume a threshold of 0.5 is sufficient. In            2(1):1078–1089, 2009.
the soccer and video-games cases, the thresholds are higher, such
                                                                          [11] J. Madhavan, S. Cohen, X. L. Dong, A. Y. Halevy, S. R.
as 0.44 for PlayerName (or video-game Title) and 0.36 for his
                                                                               Jeffery, D. Ko, and C. Yu. Web-scale data integration: You
BirthCountry. More importantly, we were able to verify that the
                                                                               can afford to pay as you go. In CIDR 2007, pages 342–350,
separable semantics property is always verified.
                                                                               2007.
                                                                          [12] A. D. Sarma, X. Dong, and A. Y. Halevy. Bootstrapping
6.    RELATED WORK                                                             pay-as-you-go data integration systems. In SIGMOD
   Our techniques are related to projects on the integration of web            Conference, pages 861–874, 2008.
data, such as PAYA S YOU G O [12]. However, the proposed inte-            [13] W. Shen, P. DeRose, R. McCann, A. Doan, and
gration techniques are based on the availability of attribute labels,          R. Ramakrishnan. Toward best-effort information extraction.
while our approach aims at integrating unlabeled data from web                 In SIGMOD Conference, pages 1031–1042, 2008.
sites. TurboWrapper [8] has similar limits: it relies on syntactic        [14] S. Skiena. The Algorithm Design Manual (2. ed.). Springer,
structure of attributes (e.g. the ISBN number for books) with-                 2008.
APPENDIX                                                                                                     < d(Vi [0, . . . , n + 1], Vk [0, . . . , n + 1]).

A.       PROOFS
  We start with a preliminary lemma needed to prove the correct-
ness of the presented algorithms.                                                               L EMMA A.2. A BSTRACT I NTEGRATION is correct.
  L EMMA A.1. I NTRAC LOSERT HAN I NTER.                                                         P ROOF. When the property separable semantics holds, the weights
∀eri∗ , erj∗ ∈ A1 , erk ∈ A2 d(V (eri∗ ), V (erj∗ )) < d(V (eri∗ ), V (erk ))                of the edges among attributes with different semantics are always
                                                                                             higher than the weights of the edges among attributes with the same
   P ROOF. The extraction rule erk can be correct or weak. We
                                                                                             semantics. This implies that the edges in L are divided in two sub-
prove the lemma for the two cases:
                                                                                             lists. In the first sublist (lower weights) we have pairs of attributes
     1. erk is correct (erk∗ ): consider eri∗ , erj∗ ∈ A1 and erk∗ ∈ A2                      that have the same semantics. We can therefore add to the solu-
        when the property separable semantics holds.                                         tion all the pairs in the first sublist. In the second sublist (higher
        By definition: ∀eri∗ , erj∗ ∈ A1 ∃ tA1 : d(V (eri∗ ), V (erj∗ )) <                   weights) we have pairs of attributes with different semantics and
        tA1                                                                                  we need to avoid to add an edge from this sublist to the solution.
        Separable semantics: ∀A1 , A2 , eri∗ ∈ A1 , erk∗ ∈ A2 : i 6=                         The problem here is that it is not know a-priori where the second
        k ∧ A1 6= A2 ⇒ d(V (eri∗ ), V (erk∗ )) > max(tA1 , tA2 )                             sublist starts. But we know that when the algorithm gets to the first
        We can derive:                                                                       edge of the second sublist, all and only the attributes with same
                                                                                             semantics have been grouped in mappings.
                d(V (eri∗ ), V (erj∗ )) < tA1 ≤ max(tA1 , tA2 ) <                               Therefore the partial solution is correct.
                                                                                                We now need to show that the algorithm stops at the first edge of
                                < d(V (eri∗ ), V (erk∗ )).                                   the second sublist. The first edge in the second sublist is an edge
        Therefore:                                                                           between two mappings m1 , m2 with different semantics. When
                                                                                             the property intensional redundancy holds, there must be a source
                   d(V (eri∗ ), V (erj∗ )) < d(V (eri∗ ), V (erk∗ )).                        which publishes two attributes ai , aj such that they are contained in
                                                                                             m1 and m2 , respectively. By the local consistency of sources there
     2. erk is weak (erkw ): in the following we treat single values                         cannot be a mapping that contains ai , aj , and therefore the first
        as singleton vectors that we denote with V [i, . . . , j] the sub-                   edge of the second sublist is detected and the algorithm ends.
        vector of values for V from index i (included) to index j (ex-
        cluded). We first introduce a monotonicity property of the                              L EMMA A.3. A BSTRACT E XTRACTION is correct.
        distance function. Given two vectors V1 and V2 with n values
        and a distance d(V1 , V2 ) between them, let V20 be a copy of                           P ROOF. In any iteration of step (a) we select two correct ex-
        V2 . If we replace the i-th element V2 [i] with a new element                        traction rules er1∗ , er2∗ ∈ A1 . This is equivalent to show that if
        V2 [i]0 such that d(V1 [i], V2 [i]) < d(V1 [i], V2 [i]0 ) it follows                 we list the pairs of extraction rules in ascending order, the first
        that d(V1 , V2 ) < d(V1 , V20 ).4                                                    pair is certainly one with correct extraction rules. Suppose, by ab-
        In this second case erkw is a weak rule, that is, it can po-                         surd, that the first pair contains a weak rule. This contradicts the
        tentially contains values taken from A1 , A2 , or any other                          I NTRAC LOSERT HAN I NTER Lemma.
        A. We consider the instance-aligned vectors of values Vk0 =                             Every time two correct extraction rules er1∗ or er2∗ are chosen, all
        V (erkw ), Vi0 = V (eri∗ ) and Vj0 = V (erj∗ ) and we remove                         the weak rules containing at least a value in common with er1∗ or
        from the analysis the instances where erkw , eri∗ , and erj∗ ex-                     er2∗ are removed (steps (c) and (d)). Therefore, after the algorithm
        tract the same value: let Vk , Vi and Vj be the vectors with the                     has chosen all the correct rules, there cannot be a weak rule in W as
        remaining values. As errw cannot contain only values coming                          weak rules mix values shared with correct rules and they have been
        from A1 (otherwise it would not be a weak rule, but a cor-                           discarded as soon as the correct rules have been identified.
        rect extraction rule of A1 ) the length of these vectors must be
        greater than zero, and notice also that Vk0 now does not con-
        tain any value coming from A1 (they have been all removed).
        We show now by induction on the length of the vectors that
        d(V (eri∗ ), V (erj∗ )) < d(V (eri∗ ), V (erkw )).
        Base case: let Vk [0] be the first value for Vk . We know that it
        is a correct value for a conceptual attribute different from A1 .
        Therefore, for the property we just showed in the previous
        case:
                         d(Vi [0], Vj [0]) < d(Vi [0], Vk [0]).
        Inductive step: the inductive hypothesis is
        d(Vi [0, . . . , n], Vj [0, . . . , n]) < d(Vi [0, . . . , n], Vk [0, . . . , n]).
        We show that it is true for n + 1 elements of the vectors.
        Again, for the property we just showed d(Vi [n + 1], Vj [n +
        1]) < d(Vi [n + 1], Vk [n + 1]) holds. For the monotonicity
        property of the distance function, it is true that
                     d(Vi [0, . . . , n + 1], Vj [0, . . . , n + 1]) <
4
    This is a natural property of the Euclidean distance.