<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Contextual Data Extraction and Instance-Based Integration</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Universita` degli Studi Roma Tre</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose a formal framework for an unsupervised approach tacking at the same time two problems: the data extraction problem, for generating the extraction rules needed to gain data from web pages, and the data integration problem, to integrate the data coming from several sources. We motivate the approach by discussing its advantages with regard to the traditional “waterfall approach”, in which data are wholly extracted before the integration starts without any mutual dependency between the two tasks. In this paper, we focus on data that are exposed by structured and redundant web sources. We introduce novel polynomial algorithms to solve the stated problems and present theoretical results on the properties of the solution generated by our approach. Finally, a preliminary experimental evaluation show the applicability of our model with real-world websites.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The development of scalable techniques to extract and integrate
data from fairly structured large corpora available on the web is a
challenging issue, because the web scale imposes the use of
unsupervised and domain independent techniques. To cope with the
complexity and the heterogeneity of web data, state-of-the-art
approaches focus on information organized according to specific
patterns that frequently occur on the web. Meaningful examples are
presented in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which focuses on data published in HTML tables,
and information extraction systems, such as TextRunner [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which
exploits lexical-syntactic patterns. As noticed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], even if a small
fraction of the web is organized according to these patterns, due to
the web scale the amount of data involved is impressive: in their
case, more than 154 millions tables were extracted from 1.1% of
the considered pages.
      </p>
      <p>In large data-intensive websites, we observe two important
characteristics that suggest new opportunities for the automatic
extraction and integration of web data:
local regularities: in these sites, large amounts of data are
usually offered by thousands of pages, each encoding one
tuple in a local HTML template. For example, each page
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. This article was presented at the workshop Very
Large Data Search (VLDS) 2011.</p>
      <p>Copyright 2011.
shown in Figure 1 comes from a different source and
publishes information about a single company stock.
global information redundancy: at the web scale many sources
provide similar information. The redundancy occurs both a
the schema level (the same attributes are published by more
than one source) and at the instance level (some objects are
published by more than one source). In our example, many
attributes are present in all the sources (e.g., the company
name, last trade price, volume); while others are published
by a subset of the sources (e.g., the “Beta” indicator). At the
extensional level, there is a set of stock quotes that are
published by more sources. As web information is inherently
imprecise, redundancy also implies inconsistencies; that is,
sources can provide conflicting information for the same
object (e.g., a different value for the volume of a given stock).</p>
      <p>These observations lead us to focus on pages that are published
following the one-tuple-per-page pattern: in each structured page
you can find information about a single tuple. If we abstract this
representation, we may say that a collection of structurally similar
pages provided by the same site corresponds to a relation.
According to this abstraction, the websites for pages in Figure 1 expose
their own version of the “StockQuote” relation.1</p>
      <p>
        Starting from the crawled web pages (for instance by using the
specialized crawler introduced in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]), our goal is: (i) transform the
web pages coming from each source into a relation, and (ii)
integrate these relations creating a database containing the information
provided by all the sources. A state-of-the-art solution to this
problem is a two-steps waterfall approach, where a schema matching
algorithm is applied over the relations returned by automatically
generated wrappers. However, when a large number of sources is
involved and a high level of automation is required, important
issues may arise:
      </p>
      <p>
        Wrapper Inference Issues: since wrappers are automatically
generated by an unsupervised process, they can produce
imprecise extraction rules (e.g., rules that extract irrelevant
information mixed with data of the domain). To obtain correct
rules, the wrappers should be evaluated and refined manually.
Integration Issues: the relations extracted by automatically
generated wrappers are “opaque”, i.e., their attributes are
not associated with any (reliable) semantic label. Therefore
the matching algorithm must rely on an instance-based
approach, which considers attribute values to match schemas.
1For the sake of simplicity, we consider only the
one-tuple-perpage pattern. Other variations of this pattern can be easily
developed for example by preprocessing the pages with tools that
fragment the HTML tables into rows [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>However, due to errors introduced by the publishing process,
instance-based matching is challenging because the sources
may provide conflicting values. Also, imprecise extraction
rules return wrong, and thus inconsistent, data.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref15 ref2">2</xref>
        ] we presented a best-effort solution to solve these issues
by taking advantage of the mutual coupling between the wrapper
inference and the data integration tasks. In the present paper, we
investigate the foundations of the problem and propose a
principled solution based on the following contributions: (i) we propose
a formal setting to state the data extraction and the data
integration problems for redundant and structured web sources; (ii) we
formulate a set of hypothesis that capture a few natural constraints
that characterize this kind of web sources; (iii) we propose novel
unsupervised polynomial algorithms to solve the stated problems
whenever these hypotheses hold; (iv) we present an experimental
evaluation of our model with real-world websites.
      </p>
      <p>In the next section we describe a generative model of the web
pages to introduce our formal setting of structured and redundant
sources. Section 3 contains the algorithms to solve the
integration and extraction problems in the case of perfectly overlapping
sources. In Section 4 we show how removing this hypothesis
affects our solution, while in Section 5 we discuss preliminary
experiments with a set of sources gathered from the Web. Section 6
discusses related works and concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>THE GENERATIVE MODEL</title>
      <p>We are interested in extracting and integrating all the available
information about a target entity, such as the STOCKQUOTE entity
of our running example. As on the Web several sources publish
information about the same entity, we can imagine that there exists
a hidden relation T , which contains all the true information about
the objects that belong to the entity, and that sources generate their
pages by taking data from T .</p>
      <p>We call conceptual instances the set of tuples I of the relation T .
Each tuple I 2 T represents a real-world object of the target entity
of interest. For example, in the case of the STOCKQUOTE entity,
the conceptual instances of T model the data about the Apple stock
quote, the Yahoo! stock quote, and so on. T has a set of attributes
A called conceptual attributes. In our example they represent the
attributes associated with a stock quote, such as the company name,
the current trade price, the volume, and so on.</p>
      <p>Given a set of sources S = fS1; : : : ; Smg, each source Si; i =
1 : : : m can be seen as the result of a generative process applied
over the hidden relation T .</p>
      <p>The attributes published by a source are called called physical
attributes, as opposed to the conceptual attributes of T , and we
write S(a) to denote that a source S publishes a physical attribute
a, and a 2 A to state that a physical attribute a contains data from
a conceptual attribute A.</p>
      <p>A source publishes information about a subset of the conceptual
instances, and different sources may publish different subsets of its
conceptual attributes.</p>
      <p>To model the presence of conflicting data that usually occur among
redundant sources, we assume that sources are noisy: they may
introduce errors, imprecise or null values, over the data picked from
the hidden relation. As depicted in Figure 2, for each source Si we
can abstract the page generation process as the application of the
following operators over the hidden relation:</p>
      <p>Selection i: returns a relation containing a subset of the
conceptual instances, (I) I.</p>
      <p>Projection i: returns a relation containing a subset of the
conceptual attributes, (A) A.</p>
      <p>Error ei: is a function that returns a relation, such that each
correct value is kept or replaced with a null value, a synthetic
value, or a value similar to the correct one.</p>
      <p>Encode i: is an encoding function that produces a web page
for each tuple by embedding its values into a HTML
template.</p>
      <p>The set of pages published by a source Si can be thought as a
view over the hidden relation, obtained by applying the above
operators as follows: Si = i(ei( i( i(T )))). From this perspective,
the extraction and the integration problems can be thought in terms
of these operators. The extraction becomes the inversion of the
operator. That is, obtaining for each source Si the associated
relation Vi = ei( i( i(T ))). The integration becomes the problem of
reconstructing T from the views associated with the sources.</p>
      <p>Notice that both problems are far from being trivial as the
stateof-the-art automatic wrapper inference systems are not able to
create perfect wrappers, and the integration task is further complicated
by the presence of errors and the absence of reliable semantic
labels.</p>
      <p>In the following we discuss under which assumptions on the
intensional and extensional redundancy exhibited by the sources, our
approach is able to deal with a bounded amount of error.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Intensional Redundancy</title>
      <p>We now discuss two properties of the generative model. The first
property expresses that the data published by each source are
locally consistent. That is, within the physical attributes published by
the same source, there cannot be distinct attributes with the same
semantics. For example, if a website states that the current value
of the stock quote “YHOO” is 17.01 there cannot be another place
in the same site where you can find a different value with the same
semantics. Therefore, we can write:</p>
      <sec id="sec-3-1">
        <title>PROPERTY 1. Local consistency:</title>
        <p>8ai; aj 2 A : S(ai) = S(aj ) ) ai = aj
(in a conceptual attribute A there cannot exist two physical
attributes coming from the same source).</p>
        <p>The second property formalizes the presence of redundancy at
the intensional level. Namely, we assume that every possible pair
of conceptual attributes is published at least by one source. Let
S(Ai) denotes a predicate that returns true if the source S publishes
the conceptual attribute Ai. Therefore, a set of sources S is called
intensionally redundant if the following property holds:</p>
      </sec>
      <sec id="sec-3-2">
        <title>PROPERTY 2. Intensional redundancy:</title>
        <p>8Ai; Aj : i 6= j 9 S : S(Ai) ^ S(Aj )
(every possible pair of conceptual attributes is published at least
by one source)</p>
        <p>Notice that given any set of websites, this property may hold
only for appropriately chosen subsets. However, the web scale
represents an opportunity to gain enough redundancy: (i) in each
domain, usually there is a subset of attributes that is published by
most of the sources;2 (ii) as the number of considered websites
increases, the probability of meeting a new conceptual attribute
decreases, and the probability of intensionally redundancy increases.</p>
        <p>In the following we call core-attributes the attributes for which
the intensional redundancy holds, and we call rare-attributes all the
others.
2.2</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Extensional Redundancy</title>
      <p>At the extensional level, we assume that sources publish a
common set of instances. For the sake of presentation, in the next
section we rely on the hypothesis that all the source publish the same
set of instances I. Later, in Section 4 we discuss how to remove
this hypothesis.</p>
    </sec>
    <sec id="sec-5">
      <title>EXTRACTION AND INTEGRATION AL</title>
    </sec>
    <sec id="sec-6">
      <title>GORITHMS</title>
      <p>Our solutions to the extraction and integration problems are
discussed in this section: we propose the problem statements, the
definition of solutions and the algorithms that solve them in polynomial
time.
3.1</p>
    </sec>
    <sec id="sec-7">
      <title>Integration Problem</title>
      <p>
        We first discuss the integration problem by ignoring the extraction
issues, then we discuss the main properties of the extraction rules,
and, finally, how the integration and the extraction problems can
be tackled together. Therefore, for the time being we assume we
have a wrapper generator that is capable of perfectly inverting the
encode operator . In other words, we do not work on web pages,
but directly on the views of T published by the sources.
2In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] the authors write “there is a core set of attributes that
appear in a large number of items”.
      </p>
      <p>Given a set of sources S, each Si publishes a view of the
hidden relation T such that Vi = ei( i( i(T ))). The integration
problem can be thought as the creation of sets of physical attributes
m1; : : : ; mn, called mappings, such that each attribute a belongs to
a mapping m and each mapping contains all and only the attributes
with the same semantics. The problem can be defined as follows:</p>
      <p>PROBLEM 1. Integration Problem : given a set of source views
V = V1; : : : ; Vn, where Vi = ei( i( i(T ))), find a set of
mappings M such that M = fm : a1; a2 2 m , a1; a2 2 Ag.</p>
      <p>Intuitively, we solve the problem by evaluating the physical
attributes of each source and by building aggregations of attributes
with the same semantics from the sources. If at the end of the
process each mapping contains all and only the physical attributes with
the same semantics, we have a solution for the problem. For
example, given a1; a3 2 V1 and a2 2 V2 with a1; a2 having the same
semantics, a solution is m1 = fa1; a2g and m2 = fa3g.</p>
      <p>If the sources publish correct data only, then a naive greedy
algorithm easily solves the problem above. However, real sources
introduce noise in the values (modeled by the error function e) that
can make the integration difficult or even not possible.</p>
      <p>To identify physical attributes with the same semantics, we rely
on a distance function d(ai; aj ) among the values of a set of
instances corresponding to the same real-world object. 3 This
function compares aligned values and returns a score between 0 and 1.
The more similar are the values, the lower is the distance. As the
distance function works by comparing values of aligned instances,
it can be easily extended to work also on conceptual attributes. We
denote d(a; A) the distance between the values of the physical
attribute a and the values of the conceptual attribute A.</p>
      <p>In the following, we study a class of error functions for which
our algorithm can compute a solution by bounding the amount of
errors that sources are allowed to introduce.</p>
      <p>
        For every conceptual attribute A, let tA denote a minimal
threshold such that any physical attribute ai belongs to A if for each
aj 2 A, d(ai; aj ) &lt; tA. For example, in the finance domain,
a low threshold should be associated with the conceptual attribute
Max value of a stock quote. This is required as there are other
conceptual attributes, like the current Price and the Min value, that
have similar values. On the other hand, the mapping for the trading
Volume conceptual attribute can have an higher threshold since it
3We assume that instances can be aligned by applying a standard
record-linkage technique [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
does not usually assume values close to those of other attributes.
Note that tA is an ideal unknown threshold: it is not given as input
of the integration problem and it is not necessary to know it a priori
to compute the solution.
      </p>
      <p>In order to solve the integration problem, it is required that the
publishing errors cannot introduce enough noise to confuse the
semantics of a physical attribute. We call this property as separable
semantics:</p>
      <sec id="sec-7-1">
        <title>PROPERTY 3. Separable semantics:</title>
        <p>8A1; A2; ai 2 A1; aj 2 A2 : ai 6= aj ^ A1 6= A2 )
d(ai; aj ) &gt; max(tA1 ; tA2 )
(it is possible to distinguish the semantics of physical attributes).</p>
        <p>In order to solve the integration problem with noisy sources, we
define the greedy clustering Algorithm 1.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Algorithm 1 ABSTRACTINTEGRATION</title>
        <p>Input: A set of locally consistent, intensionally redundant sources
with separable physical attributes.</p>
        <p>Output: The correct set of mapping M.</p>
        <p>Let G = (N; E) be a graph where every attribute ai for every
source Si 2 S is a node n 2 N . For every pair of distinct nodes
ai; aj 2 N such that S(ai) 6= S(aj ) add an edge e between them
to E and let d(ai; aj ) be the weight of e.</p>
        <p>Let m(ai) be the mapping containing the attribute ai.
1. Add to M a mapping m = faig for each node ni 2 N ,</p>
      </sec>
      <sec id="sec-7-3">
        <title>2. insert in a list L the edges E,</title>
      </sec>
      <sec id="sec-7-4">
        <title>3. sort L by the weight of the edges in ascending order,</title>
        <p>4. for each edge (a1; a2) 2 L:
(a) let m be the union of the attributes in m(a1) and m(a2)
(b) if in m there is no pair of ai; aj such that S(ai) =</p>
        <p>S(aj )
(c)
(d)
then add m and remove m(a1), m(a2) from M
else break.</p>
        <p>We are now ready to prove the correctness of the integration
algorithm.</p>
      </sec>
      <sec id="sec-7-5">
        <title>LEMMA 3.1. ABSTRACTINTEGRATION is correct. PROOF. Moved to Appendix A.</title>
        <p>ABSTRACTINTEGRATION is O(n2) over the total number of
physical attributes, in fact most of the time is required to create
the edges of the graph G.</p>
        <p>In the following we introduce the extraction problem, that is,
how to get the physical attributes we considered as input of the
integration problem.
3.2</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Extraction Rules</title>
      <p>In our framework, a data source S is a collection of pages S =
p1; : : : ; pn from the same website, such that each page publishes
information about one object of a real-world entity of interest.</p>
      <p>We distinguish between two different types of values that can
appear in a page: target values, that is, values that are derived from
the hidden relation T , and noise values, that is, values that are not
of interest for our purpose (e.g., advertising, template, layout, etc).</p>
      <p>We consider as given an unsupervised wrapper generator
system W. A wrapper w is an ordered set of extraction rules, w =
fer1; : : : ; erkg, that apply over a web page: each rule extracts a
string from the HTML of the page. We denote er(p) the string
returned by the application of the rule er over the page p. The
application of a wrapper w over a page p, denoted w(p), returns a tuple
t = her1(p); : : : ; erk(p)i; therefore, the application of a wrapper
over the set of pages of a source S returns a relation w(S), which
has as many attributes as the number of extraction rules of the
wrapper. A column of the relation is a vector of values denoted V (eri):
it is the result of the application of an extraction rule eri over the
pages of a source.</p>
      <p>We say that an extraction rule er is correct if for every given
page it extracts a value of the same conceptual attribute (i.e., target
values with the same semantics) or a null value if the value for the
attribute is missing in that page. If a correct extraction rule only
extracts noise values, it is considered noisy. We also say that an
extraction rule erw is weak if it mixes either target values with
different semantics or target values with noise values.</p>
      <p>Unsupervised wrapper generators are powerful enough to infer
the correct extraction rules needed to cover the data exposed by
what we call regularly structured websites.</p>
      <p>PROPERTY 4. Regularly structured sources: The sources S =
fSig are regularly structured w.r.t. a given unsupervised wrapper
generator system W, if W generates for each source Si 2 S a set
of rules wi containing all the correct rules.</p>
      <p>However, wrapper generators cannot automatically identify, among
the generated rules, which are the correct ones. They also produce
weak rules, since, at wrapper generation time there is not enough
information to automatically establish if a rule is either correct or
weak. The integration algorithm has been presented considering
only the correct rules (i.e., physical attributes). However, noisy
rules, if considered along with the correct ones, are harmless as
they can be identified and deleted during the integration step. They
will eventually generate singleton mappings of size one since the
distance between a noisy rule and a correct rule prevent them from
grouping. Similar arguments apply for distances amongst noisy
rules.</p>
      <p>Weak rules require a more detailed discussion, and unfortunately
they cannot be identified and disregarded at wrapper generation
step. In the following we show that, if we keep the same
assumptions introduced for the integration problem, we can always identify
weak rules during the integration step.
3.3</p>
    </sec>
    <sec id="sec-9">
      <title>Extraction Problem</title>
      <sec id="sec-9-1">
        <title>The extraction problem is defined as follows:</title>
        <p>PROBLEM 2. Extraction Problem: given a set of sources S =
fSig, produce a set of wrappers W = fwig, such that wi
contains all and only the correct rules for Si.</p>
        <p>We now describe how we leverage the redundant information
among different sources to identify and filter out the weak rules.
Let eri and erj be two extraction rules. We say that two extraction
rules “overlap” if they extract from a page the same occurrence of
the same string. In this case, one of them must be a weak rule. In
other terms, if many rules are extracting the same value occurrence
from at least a page, only one of them is a correct rule and all the
others are weak ones. With an abuse of notation, we will say that
er 2 A to state when an extraction rule extracts at least a correct
value of the conceptual attribute A. Notice that, as a weak rule
erw can extract values from n conceptual attributes, we can say
erw 2 A1; : : : ; An.</p>
        <p>These intuitions are applied in the following greedy algorithm
which solves the extraction problem:</p>
      </sec>
      <sec id="sec-9-2">
        <title>Algorithm 2 ABSTRACTEXTRACTION</title>
        <p>Input: A set of locally consistent, intensionally redundant sources
with separable physical attributes; the set of wrappers W produced
by a wrapper generator system W w.r.t. which the sources are
regularly structured.</p>
        <p>Output: A set of wrappers W that do not contain weak rules.
1. while there is a er 2 W which is not marked as correct:
(a) let d(V (eri); V (erj )) be the minimal distance between
the values of two extraction rules in W and at least one
of them is not marked as correct
(b) mark eri and erj as correct, (they are correct rules)
(c) remove from W all the rules that overlaps with eri
(they are weak rules)
(d) remove from W all the rules that overlaps with erj
(they are weak rules)
2. now W is W .</p>
        <p>The algorithm ABSTRACTEXTRACTION takes as input a set of
wrappers W and computes W which does not contain weak rules.
To explain the algorithm, we rely on the example in Figure 3.
Consider two websites and two objects (say, two stock quotes) that are
published by both sites. In the figure, a circle represents the values
extracted by an extraction rule and its number represents the
website is has been executed on. For example, in the diagram on the
left, the dark circle marked with 1 extracts from the first website the
values 10 for the first object and 11 for the second object. Notice
that the input of the algorithm are the circles annotated with the
extracted values and the website of provenience. Let say now that in
step (a) the algorithm identifies the dark circles as the closest ones,
and mark them as correct in step (b). At this point both the circles
with horizontal stripes overlap with correct rules and can therefore
be removed in steps (c) and (d). In the diagram on the right we
show the resulting scenario: the dark circles are now the closest
rules and are marked as correct. The remaining dashed circles do
not match at all (i.e., they are noisy rules) and raise the creation
of two singleton mappings. To prove that the above algorithm is
correct we introduce the following lemma:</p>
      </sec>
      <sec id="sec-9-3">
        <title>LEMMA 3.2. ABSTRACTEXTRACTION is correct. PROOF. Moved to Appendix A.</title>
        <p>ABSTRACTEXTRACTION is O(n2) over the total number of
extraction rules generated by the automatic wrapper generation
system. Like in the case of ABSTRACTINTEGRATION, most of the
time is spent computing distances between the extracted values.
4.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>NON-OVERLAPPING SOURCES</title>
      <p>So far we have simplified the discussion by hypothesizing that
every source publishes data about every object in I. In this section
we remove this simplification and use IiA to denote the subset of I
for which Si provides values of the conceptual attribute A.</p>
      <p>The distances amongst physically attributes from several sources
have been computed using an instance-based metric that relies on
the availability of a set of shared instances between the involved
sources. Therefore, if we want to compute the direct distance
between two attributes d(ai; aj ), with S(ai) = Si and S(aj ) = Sj ,
we need a non-empty overlap of objects between Si and Sj (Si 6=
Sj ), otherwise we consider d(ai; aj ) = 1.</p>
      <p>To formalize this aspect, and given a positive integer parameter
q, let OVq;A(Si; Sj ) be a predicate true iff jIiA \ IjAj q, i.e.
both Si and Sj publish a value of the attribute A for a shared set of
at least q instances.</p>
      <p>Intuitively, OVq;A(Si; Sj ) is true if we consider Si and Sj to
share enough instances (at least q) to be directly comparable on A’s
values. The value of q has to be chosen according to a trade-off:
the higher the value, the more reliably the instance-based distance
would perform, as it can be computed over a larger set of shared
instances; however, it is possible that a lower number of sources
will be directly comparable.</p>
      <p>We are now ready to tackle the main issue: how to the compute
the distance when the direct distance is not defined? i.e., the overlap
is not sufficient or not available at all and OVq;A(Si; Sj ) does not
hold.</p>
      <p>We introduce the indirect distance d by leveraging the
intermediate sources sharing instances with two sources not directly
having enough overlap. Given a third source Sw, such that both
OVq;A(Si; Sw) and OVq;A(Sw; Sj ) hold, as for the shortest path
among two points is a straight line, we can easily write: d(ai; aj )
d(ai; aw) + d(aw; aj ). In this case, we have an upper bound for
d(ai; aj ) that we call indirect distance, based on the availability of
two direct distances between (Si; Sw) and between (Sw; Sj ).</p>
      <p>In the previous example we used just one intermediate source
(Sw); the same principle can be trivially extended to a generic
number of intermediate sources. However, the more
intermediate sources are involved, the less precise is the bound imposed by
d(ai; aj ). In the case that we have multiple possible indirect
distances, the bound chosen is the smallest one.</p>
      <p>We call OVq;A the transitive closure of OVq;A: OVq;A(Si; Sj )
is true iff it is possible to compute a distance (direct or indirect)
between Si; Sj for the attribute A. If two attributes ai and aj
are not comparable over the same conceptual attribute A, that is,
OVq;A(Si; Sj ) is false, we set d(ai; aj ) = 1.</p>
      <p>Let S(A) denote the set of websites in S publishing values of
the conceptual attribute A 2 A : a set of websites S are called
extensionally redundant if the following property holds:</p>
      <sec id="sec-10-1">
        <title>PROPERTY 5. Extensional redundancy:</title>
        <p>8A 2 A OVq;A = S(A) S(A)
( the overlap of websites’ objects allows the computation of indirect
distances )
Essentially, it is required that the indirect-distance d can be
computed between any pair of physical attributes.</p>
        <p>
          Observe that analogously to rare-attributes, a concept of
rareinstances could be introduced. Anyway, the crawler [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] used during
the experiments gathers up extensionally redundant websites.
        </p>
        <p>
          All the results previously obtained continue to hold with the
new definition of distance, and can be applied even in presence of
sources that do not contain all the objects I of the hidden relation
T provided that the input sources are extensional redundant.
However, in order to obtain a solution with indirect distances, we
increase the complexity of the algorithms from quadratic to cubic, as
we reduce the computation of the distance function to the problem
of finding shortest paths in a graph by modeling physical attributes
as nodes and the distances among them as weighted edges. This
can be solved with the Floyd–Warshall algorithm [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>EXPERIMENTAL EVALUATION</title>
      <p>
        In this section we present a preliminary experimental evaluation
of our model, conducted by using a special-purpose crawler [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to
collect 100 websites over three application domains: Soccer,
Videogames, and Finance. Each source consists of tens to thousands
of pages, and each page contains detailed data about one object
of the corresponding entity. For each domain, we then selected
the 20 largest sources and manually verified the hypothesis of our
generative process.
      </p>
      <p>Intensional Redundancy We start evaluating the redundancy at
the schema level. We observe that in the soccer and video-game
domains the majority of the conceptual attributes are not rare. In fact,
in the soccer domain, only 25% of the attributes are rare and they
mostly come from websites that are not only about soccer players.
For example, a website containing info about olympic athletes
exposes the attribute medals, while a club website exposes the debut
date only for the players coming from its own youth academy. For
the video-games the percentage of rare attributes is slightly over
30% of the total, in this case rare attributes come from distinct
information with very similar semantics, such as difficulty and
learning curve. Finally, in the stock quote scenario the percentage of
rare attributes is over 40% of the total. This is not surprising, as
in financial domain there is a large number of attributes (89 for 20
sources) due to the presence of many indicators used for technical
analysis. For this domain, 20 sites are sufficient only to get a very
rough estimation of the set of attributes published by the sites of
the domain. We expect that the percentage of rare attributes would
significantly drop as the number of web sources increases.</p>
      <p>Extensional Redundancy Within the same domain, several
objects are shared by several sources. The overlap is almost total for
the stock quotes, while it is more articulated for soccer players and
video-games as these domains include both large popular sites and
small ones. Over the 100 sources, we computed that each soccer
player object appears on average in 1.6 sources, each video-game
in 24.5 sources, and each stock quote in 92.8 sources. In particular,
OVq;A(Si; Sj ) is true with q = 5 for all the non rare attributes for
all the websites.</p>
      <p>Errors and Thresholds We manually verified the thresholds for
the conceptual attributes of the three domains. As expected, those
have very different values, depending on the domain and the
attribute considered. As an example, in the finance domain a
threshold of 0.023 is needed for the Max value of a stock quote (0.029 for
the Min), while for the Volume a threshold of 0.5 is sufficient. In
the soccer and video-games cases, the thresholds are higher, such
as 0.44 for PlayerName (or video-game Title) and 0.36 for his
BirthCountry. More importantly, we were able to verify that the
separable semantics property is always verified.</p>
    </sec>
    <sec id="sec-12">
      <title>RELATED WORK</title>
      <p>
        Our techniques are related to projects on the integration of web
data, such as PAYASYOUGO [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. However, the proposed
integration techniques are based on the availability of attribute labels,
while our approach aims at integrating unlabeled data from web
sites. TurboWrapper [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] has similar limits: it relies on syntactic
structure of attributes (e.g. the ISBN number for books)
without considering the redundancy of information that occurs at the
instance-level.
      </p>
      <p>
        The exploitation of structured web data is the primary goal of
WebTables [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and ListExtract [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which concentrate on data
published in HTML tables and lists, respectively. Compared to
information extraction approaches, WebTables and ListExtract extract
relations with involved relational schemas but it does not address
the issue of integrating the extracted data.
      </p>
      <p>
        Cafarella et al. have described a system to populate a
probabilistic database with data extracted from the web [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, the
data are retrieved by TextRunner [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], an information extraction
system that is not targeted to data rich web pages as ours. Octopus [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
and Cimple [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] support users in the creation of data sets from web
data by means of a set of operators to perform search, extraction,
data cleaning and integration. Although such systems have a more
general application scope than ours, they involve users in the
process, while our approach is completely automatic.
7.
      </p>
    </sec>
    <sec id="sec-13">
      <title>APPENDIX A.</title>
    </sec>
    <sec id="sec-14">
      <title>PROOFS</title>
      <p>We start with a preliminary lemma needed to prove the
correctness of the presented algorithms.</p>
      <sec id="sec-14-1">
        <title>LEMMA A.1. INTRACLOSERTHANINTER.</title>
        <p>8eri ; erj 2 A1; erk 2 A2 d(V (eri ); V (erj )) &lt; d(V (eri ); V (erk))</p>
        <p>PROOF. The extraction rule erk can be correct or weak. We
prove the lemma for the two cases:
1. erk is correct (erk): consider eri ; erj 2 A1 and erk 2 A2
when the property separable semantics holds.</p>
        <p>By definition: 8eri ; erj 2 A1 9 tA1 : d(V (eri ); V (erj )) &lt;
tA1
Separable semantics: 8A1; A2; eri 2 A1; erk 2 A2 : i 6=
k ^ A1 6= A2 ) d(V (eri ); V (erk)) &gt; max(tA1 ; tA2 )
We can derive:
d(V (eri ); V (erj )) &lt; tA1</p>
        <p>max(tA1 ; tA2 ) &lt;
&lt; d(V (eri ); V (erk)):
Therefore:</p>
        <p>d(V (eri ); V (erj )) &lt; d(V (eri ); V (erk)):</p>
      </sec>
      <sec id="sec-14-2">
        <title>Inductive step: the inductive hypothesis is</title>
        <p>We show that it is true for n + 1 elements of the vectors.
Again, for the property we just showed d(Vi[n + 1]; Vj [n +
1]) &lt; d(Vi[n + 1]; Vk[n + 1]) holds. For the monotonicity
property of the distance function, it is true that</p>
      </sec>
      <sec id="sec-14-3">
        <title>LEMMA A.2. ABSTRACTINTEGRATION is correct.</title>
        <p>PROOF. When the property separable semantics holds, the weights
of the edges among attributes with different semantics are always
higher than the weights of the edges among attributes with the same
semantics. This implies that the edges in L are divided in two
sublists. In the first sublist (lower weights) we have pairs of attributes
that have the same semantics. We can therefore add to the
solution all the pairs in the first sublist. In the second sublist (higher
weights) we have pairs of attributes with different semantics and
we need to avoid to add an edge from this sublist to the solution.
The problem here is that it is not know a-priori where the second
sublist starts. But we know that when the algorithm gets to the first
edge of the second sublist, all and only the attributes with same
semantics have been grouped in mappings.</p>
        <p>Therefore the partial solution is correct.</p>
        <p>We now need to show that the algorithm stops at the first edge of
the second sublist. The first edge in the second sublist is an edge
between two mappings m1; m2 with different semantics. When
the property intensional redundancy holds, there must be a source
which publishes two attributes ai; aj such that they are contained in
m1 and m2, respectively. By the local consistency of sources there
cannot be a mapping that contains ai; aj , and therefore the first
edge of the second sublist is detected and the algorithm ends.</p>
      </sec>
      <sec id="sec-14-4">
        <title>LEMMA A.3. ABSTRACTEXTRACTION is correct.</title>
        <p>PROOF. In any iteration of step (a) we select two correct
extraction rules er1 ; er2 2 A1. This is equivalent to show that if
we list the pairs of extraction rules in ascending order, the first
pair is certainly one with correct extraction rules. Suppose, by
absurd, that the first pair contains a weak rule. This contradicts the
INTRACLOSERTHANINTER Lemma.</p>
        <p>Every time two correct extraction rules er1 or er2 are chosen, all
the weak rules containing at least a value in common with er1 or
er2 are removed (steps (c) and (d)). Therefore, after the algorithm
has chosen all the correct rules, there cannot be a weak rule in W as
weak rules mix values shared with correct rules and they have been
discarded as soon as the correct rules have been identified.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Banko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cafarella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Soderland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Broadhead</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          .
          <article-title>Open information extraction from the web</article-title>
          .
          <source>In IJCAI</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Blanco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bronzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Crescenzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Merialdo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          .
          <article-title>Redundancy-driven web data extraction and integration</article-title>
          . In WebDB,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Blanco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Crescenzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Merialdo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          .
          <article-title>Supporting the automatic construction of entity aware search engines</article-title>
          .
          <source>In WIDM</source>
          , pages
          <fpage>149</fpage>
          -
          <lpage>156</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Cafarella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and D.</given-names>
            <surname>Suciu</surname>
          </string-name>
          .
          <article-title>Structured queries over web text</article-title>
          .
          <source>IEEE Data Eng. Bull.</source>
          ,
          <volume>29</volume>
          (
          <issue>4</issue>
          ):
          <fpage>45</fpage>
          -
          <lpage>51</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Cafarella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Halevy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Khoussainova</surname>
          </string-name>
          .
          <article-title>Data integration for the relational web</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1090</fpage>
          -
          <lpage>1101</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Cafarella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Halevy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          , E. Wu, and
          <string-name>
            <surname>Y. Zhang.</surname>
          </string-name>
          <article-title>Webtables: exploring the power of tables on the web</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <fpage>538</fpage>
          -
          <lpage>549</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ye</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Template detection for large scale search engines</article-title>
          .
          <source>In Proceedings of the 2006 ACM symposium on Applied computing, SAC '06</source>
          , pages
          <fpage>1094</fpage>
          -
          <lpage>1098</lpage>
          , New York, NY, USA,
          <year>2006</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.-L.</given-names>
            <surname>Chuang</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. C.-C. Chang</surname>
            , and
            <given-names>C. X.</given-names>
          </string-name>
          <string-name>
            <surname>Zhai</surname>
          </string-name>
          .
          <article-title>Context-aware wrapping: Synchronized data extraction</article-title>
          .
          <source>In VLDB</source>
          , pages
          <fpage>699</fpage>
          -
          <lpage>710</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Elmagarmid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Ipeirotis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Verykios</surname>
          </string-name>
          .
          <article-title>Duplicate record detection: A survey</article-title>
          .
          <source>IEEE Trans. Knowl</source>
          . Data Eng.,
          <volume>19</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Elmeleegy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Madhavan</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Halevy</surname>
          </string-name>
          .
          <article-title>Harvesting relational tables from lists on the web</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1078</fpage>
          -
          <lpage>1089</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Madhavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Halevy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Jeffery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ko</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>Web-scale data integration: You can afford to pay as you go</article-title>
          .
          <source>In CIDR 2007</source>
          , pages
          <fpage>342</fpage>
          -
          <lpage>350</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>A. D. Sarma</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Dong</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Halevy</surname>
          </string-name>
          .
          <article-title>Bootstrapping pay-as-you-go data integration systems</article-title>
          .
          <source>In SIGMOD Conference</source>
          , pages
          <fpage>861</fpage>
          -
          <lpage>874</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.</given-names>
            <surname>Shen</surname>
          </string-name>
          , P. DeRose,
          <string-name>
            <given-names>R.</given-names>
            <surname>McCann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          .
          <article-title>Toward best-effort information extraction</article-title>
          .
          <source>In SIGMOD Conference</source>
          , pages
          <fpage>1031</fpage>
          -
          <lpage>1042</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Skiena</surname>
          </string-name>
          .
          <article-title>The Algorithm Design Manual (2</article-title>
          . ed.). Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>2. erk is weak (erkw): in the following we treat single values as singleton vectors that we denote with V [i; : : : ; j] the subvector of values for V from index i (included) to index j (excluded). We first introduce a monotonicity property of the distance function. Given two vectors V1 and V2 with n values and a distance d(V1; V2) between them, let V20 be a copy of V2. If we replace the i-th element V2[i] with a new element V2[i]0 such that d(V1[i]; V2[i]) &lt; d(V1[i]; V2[i]0) it follows that d(V1; V2) &lt; d(V1; V20).4 In this second case erkw is a weak rule, that is, it can potentially contains values taken from A1, A2, or any other A. We consider the instance-aligned vectors of values Vk0 = V (erkw), Vi0 = V (eri ) and Vj0 = V (erj ) and we remove from the analysis the instances where erkw, eri , and erj extract the same value: let Vk; Vi and Vj be the vectors with the remaining values. As errw cannot contain only values coming from A1 (otherwise it would not be a weak rule, but a correct extraction rule of A1) the length of these vectors must be greater than zero, and notice also that Vk0 now does not contain any value coming from A1 (they have been all removed). We show now by induction on the length of the vectors that d(V (eri ); V (erj )) &lt; d(V (eri ); V (erkw)). Base case: let Vk[0] be the first value for Vk. We know that it is a correct value for a conceptual attribute different from A1. Therefore, for the property we just showed in the previous case:</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>