Contextual Data Extraction and Instance-Based Integration Lorenzo Blanco Valter Crescenzi Paolo Merialdo Paolo Papotti Dipartimento di Informatica ed Automazione Università degli Studi Roma Tre blanco,crescenz,merialdo,papotti@dia.uniroma3.it ABSTRACT shown in Figure 1 comes from a different source and pub- We propose a formal framework for an unsupervised approach tack- lishes information about a single company stock. ing at the same time two problems: the data extraction problem, for • global information redundancy: at the web scale many sources generating the extraction rules needed to gain data from web pages, provide similar information. The redundancy occurs both a and the data integration problem, to integrate the data coming from the schema level (the same attributes are published by more several sources. We motivate the approach by discussing its advan- than one source) and at the instance level (some objects are tages with regard to the traditional “waterfall approach”, in which published by more than one source). In our example, many data are wholly extracted before the integration starts without any attributes are present in all the sources (e.g., the company mutual dependency between the two tasks. name, last trade price, volume); while others are published In this paper, we focus on data that are exposed by structured and by a subset of the sources (e.g., the “Beta” indicator). At the redundant web sources. We introduce novel polynomial algorithms extensional level, there is a set of stock quotes that are pub- to solve the stated problems and present theoretical results on the lished by more sources. As web information is inherently properties of the solution generated by our approach. Finally, a imprecise, redundancy also implies inconsistencies; that is, preliminary experimental evaluation show the applicability of our sources can provide conflicting information for the same ob- model with real-world websites. ject (e.g., a different value for the volume of a given stock). These observations lead us to focus on pages that are published 1. INTRODUCTION following the one-tuple-per-page pattern: in each structured page The development of scalable techniques to extract and integrate you can find information about a single tuple. If we abstract this data from fairly structured large corpora available on the web is a representation, we may say that a collection of structurally similar challenging issue, because the web scale imposes the use of un- pages provided by the same site corresponds to a relation. Accord- supervised and domain independent techniques. To cope with the ing to this abstraction, the websites for pages in Figure 1 expose complexity and the heterogeneity of web data, state-of-the-art ap- their own version of the “StockQuote” relation.1 proaches focus on information organized according to specific pat- Starting from the crawled web pages (for instance by using the terns that frequently occur on the web. Meaningful examples are specialized crawler introduced in [3]), our goal is: (i) transform the presented in [6], which focuses on data published in HTML tables, web pages coming from each source into a relation, and (ii) inte- and information extraction systems, such as TextRunner [1], which grate these relations creating a database containing the information exploits lexical-syntactic patterns. As noticed in [6], even if a small provided by all the sources. A state-of-the-art solution to this prob- fraction of the web is organized according to these patterns, due to lem is a two-steps waterfall approach, where a schema matching the web scale the amount of data involved is impressive: in their algorithm is applied over the relations returned by automatically case, more than 154 millions tables were extracted from 1.1% of generated wrappers. However, when a large number of sources is the considered pages. involved and a high level of automation is required, important is- In large data-intensive websites, we observe two important char- sues may arise: acteristics that suggest new opportunities for the automatic extrac- tion and integration of web data: • Wrapper Inference Issues: since wrappers are automatically generated by an unsupervised process, they can produce im- • local regularities: in these sites, large amounts of data are precise extraction rules (e.g., rules that extract irrelevant in- usually offered by thousands of pages, each encoding one formation mixed with data of the domain). To obtain correct tuple in a local HTML template. For example, each page rules, the wrappers should be evaluated and refined manually. • Integration Issues: the relations extracted by automatically generated wrappers are “opaque”, i.e., their attributes are Permission to make digital or hard copies of all or part of this work for not associated with any (reliable) semantic label. Therefore personal or classroom use is granted without fee provided that copies are the matching algorithm must rely on an instance-based ap- not made or distributed for profit or commercial advantage and that copies proach, which considers attribute values to match schemas. bear this notice and the full citation on the first page. To copy otherwise, to 1 republish, to post on servers or to redistribute to lists, requires prior specific For the sake of simplicity, we consider only the one-tuple-per- permission and/or a fee. This article was presented at the workshop Very page pattern. Other variations of this pattern can be easily devel- Large Data Search (VLDS) 2011. oped for example by preprocessing the pages with tools that frag- Copyright 2011. ment the HTML tables into rows [7]. over the hidden relation T . The attributes published by a source are called called physical attributes, as opposed to the conceptual attributes of T , and we write S(a) to denote that a source S publishes a physical attribute a, and a ∈ A to state that a physical attribute a contains data from a conceptual attribute A. A source publishes information about a subset of the conceptual instances, and different sources may publish different subsets of its conceptual attributes. To model the presence of conflicting data that usually occur among redundant sources, we assume that sources are noisy: they may in- troduce errors, imprecise or null values, over the data picked from the hidden relation. As depicted in Figure 2, for each source Si we can abstract the page generation process as the application of the following operators over the hidden relation: Figure 1: Two web pages containing data about stock quotes from Reuters and Google finance websites. • Selection σi : returns a relation containing a subset of the conceptual instances, σ(I) ⊆ I. However, due to errors introduced by the publishing process, • Projection πi : returns a relation containing a subset of the instance-based matching is challenging because the sources conceptual attributes, π(A) ⊆ A. may provide conflicting values. Also, imprecise extraction • Error ei : is a function that returns a relation, such that each rules return wrong, and thus inconsistent, data. correct value is kept or replaced with a null value, a synthetic value, or a value similar to the correct one. In [2] we presented a best-effort solution to solve these issues by taking advantage of the mutual coupling between the wrapper • Encode λi : is an encoding function that produces a web page inference and the data integration tasks. In the present paper, we for each tuple by embedding its values into a HTML tem- investigate the foundations of the problem and propose a princi- plate. pled solution based on the following contributions: (i) we propose a formal setting to state the data extraction and the data integra- The set of pages published by a source Si can be thought as a tion problems for redundant and structured web sources; (ii) we view over the hidden relation, obtained by applying the above oper- formulate a set of hypothesis that capture a few natural constraints ators as follows: Si = λi (ei (πi (σi (T )))). From this perspective, that characterize this kind of web sources; (iii) we propose novel the extraction and the integration problems can be thought in terms unsupervised polynomial algorithms to solve the stated problems of these operators. The extraction becomes the inversion of the λ whenever these hypotheses hold; (iv) we present an experimental operator. That is, obtaining for each source Si the associated rela- evaluation of our model with real-world websites. tion Vi = ei (πi (σi (T ))). The integration becomes the problem of In the next section we describe a generative model of the web reconstructing T from the views associated with the sources. pages to introduce our formal setting of structured and redundant Notice that both problems are far from being trivial as the state- sources. Section 3 contains the algorithms to solve the integra- of-the-art automatic wrapper inference systems are not able to cre- tion and extraction problems in the case of perfectly overlapping ate perfect wrappers, and the integration task is further complicated sources. In Section 4 we show how removing this hypothesis af- by the presence of errors and the absence of reliable semantic la- fects our solution, while in Section 5 we discuss preliminary ex- bels. periments with a set of sources gathered from the Web. Section 6 In the following we discuss under which assumptions on the in- discusses related works and concludes the paper. tensional and extensional redundancy exhibited by the sources, our approach is able to deal with a bounded amount of error. 2. THE GENERATIVE MODEL 2.1 Intensional Redundancy We are interested in extracting and integrating all the available in- We now discuss two properties of the generative model. The first formation about a target entity, such as the S TOCK Q UOTE entity property expresses that the data published by each source are lo- of our running example. As on the Web several sources publish cally consistent. That is, within the physical attributes published by information about the same entity, we can imagine that there exists the same source, there cannot be distinct attributes with the same a hidden relation T , which contains all the true information about semantics. For example, if a website states that the current value the objects that belong to the entity, and that sources generate their of the stock quote “YHOO” is 17.01 there cannot be another place pages by taking data from T . in the same site where you can find a different value with the same We call conceptual instances the set of tuples I of the relation T . semantics. Therefore, we can write: Each tuple I ∈ T represents a real-world object of the target entity of interest. For example, in the case of the S TOCK Q UOTE entity, P ROPERTY 1. Local consistency: the conceptual instances of T model the data about the Apple stock ∀ai , aj ∈ A : S(ai ) = S(aj ) ⇒ ai = aj quote, the Yahoo! stock quote, and so on. T has a set of attributes (in a conceptual attribute A there cannot exist two physical at- A called conceptual attributes. In our example they represent the tributes coming from the same source). attributes associated with a stock quote, such as the company name, the current trade price, the volume, and so on. The second property formalizes the presence of redundancy at Given a set of sources S = {S1 , . . . , Sm }, each source Si , i = the intensional level. Namely, we assume that every possible pair 1 . . . m can be seen as the result of a generative process applied of conceptual attributes is published at least by one source. Let Figure 2: The publishing process: the web sources are views over the hidden relation generated by four operators. S(Ai ) denotes a predicate that returns true if the source S publishes Given a set of sources S, each Si publishes a view of the hid- the conceptual attribute Ai . Therefore, a set of sources S is called den relation T such that Vi = ei (πi (σi (T ))). The integration intensionally redundant if the following property holds: problem can be thought as the creation of sets of physical attributes m1 , . . . , mn , called mappings, such that each attribute a belongs to P ROPERTY 2. Intensional redundancy: a mapping m and each mapping contains all and only the attributes ∀Ai , Aj : i 6= j ∃ S : S(Ai ) ∧ S(Aj ) with the same semantics. The problem can be defined as follows: (every possible pair of conceptual attributes is published at least by one source) P ROBLEM 1. Integration Problem : given a set of source views V = V1 , . . . , Vn , where Vi = ei (πi (σi (T ))), find a set of map- Notice that given any set of websites, this property may hold pings M such that M = {m : a1 , a2 ∈ m ⇔ a1 , a2 ∈ A}. only for appropriately chosen subsets. However, the web scale rep- resents an opportunity to gain enough redundancy: (i) in each do- Intuitively, we solve the problem by evaluating the physical at- main, usually there is a subset of attributes that is published by tributes of each source and by building aggregations of attributes most of the sources;2 (ii) as the number of considered websites with the same semantics from the sources. If at the end of the pro- increases, the probability of meeting a new conceptual attribute de- cess each mapping contains all and only the physical attributes with creases, and the probability of intensionally redundancy increases. the same semantics, we have a solution for the problem. For exam- In the following we call core-attributes the attributes for which ple, given a1 , a3 ∈ V1 and a2 ∈ V2 with a1 , a2 having the same the intensional redundancy holds, and we call rare-attributes all the semantics, a solution is m1 = {a1 , a2 } and m2 = {a3 }. others. If the sources publish correct data only, then a naive greedy al- 2.2 Extensional Redundancy gorithm easily solves the problem above. However, real sources introduce noise in the values (modeled by the error function e) that At the extensional level, we assume that sources publish a com- can make the integration difficult or even not possible. mon set of instances. For the sake of presentation, in the next sec- To identify physical attributes with the same semantics, we rely tion we rely on the hypothesis that all the source publish the same on a distance function d(ai , aj ) among the values of a set of in- set of instances I. Later, in Section 4 we discuss how to remove stances corresponding to the same real-world object.3 This func- this hypothesis. tion compares aligned values and returns a score between 0 and 1. The more similar are the values, the lower is the distance. As the 3. EXTRACTION AND INTEGRATION AL- distance function works by comparing values of aligned instances, GORITHMS it can be easily extended to work also on conceptual attributes. We Our solutions to the extraction and integration problems are dis- denote d(a, A) the distance between the values of the physical at- cussed in this section: we propose the problem statements, the defi- tribute a and the values of the conceptual attribute A. nition of solutions and the algorithms that solve them in polynomial In the following, we study a class of error functions for which time. our algorithm can compute a solution by bounding the amount of errors that sources are allowed to introduce. 3.1 Integration Problem For every conceptual attribute A, let tA denote a minimal thresh- We first discuss the integration problem by ignoring the extraction old such that any physical attribute ai belongs to A if for each issues, then we discuss the main properties of the extraction rules, aj ∈ A, d(ai , aj ) < tA . For example, in the finance domain, and, finally, how the integration and the extraction problems can a low threshold should be associated with the conceptual attribute be tackled together. Therefore, for the time being we assume we Max value of a stock quote. This is required as there are other have a wrapper generator that is capable of perfectly inverting the conceptual attributes, like the current Price and the Min value, that encode operator λ. In other words, we do not work on web pages, have similar values. On the other hand, the mapping for the trading but directly on the views of T published by the sources. Volume conceptual attribute can have an higher threshold since it 2 3 In [11] the authors write “there is a core set of attributes that ap- We assume that instances can be aligned by applying a standard pear in a large number of items”. record-linkage technique [9]. does not usually assume values close to those of other attributes. We consider as given an unsupervised wrapper generator sys- Note that tA is an ideal unknown threshold: it is not given as input tem W. A wrapper w is an ordered set of extraction rules, w = of the integration problem and it is not necessary to know it a priori {er1 , . . . , erk }, that apply over a web page: each rule extracts a to compute the solution. string from the HTML of the page. We denote er(p) the string re- In order to solve the integration problem, it is required that the turned by the application of the rule er over the page p. The appli- publishing errors cannot introduce enough noise to confuse the se- cation of a wrapper w over a page p, denoted w(p), returns a tuple mantics of a physical attribute. We call this property as separable t = her1 (p), . . . , erk (p)i; therefore, the application of a wrapper semantics: over the set of pages of a source S returns a relation w(S), which has as many attributes as the number of extraction rules of the wrap- P ROPERTY 3. Separable semantics: per. A column of the relation is a vector of values denoted V (eri ): ∀A1 , A2 , ai ∈ A1 , aj ∈ A2 : ai 6= aj ∧ A1 6= A2 ⇒ it is the result of the application of an extraction rule eri over the d(ai , aj ) > max(tA1 , tA2 ) pages of a source. (it is possible to distinguish the semantics of physical attributes). We say that an extraction rule er∗ is correct if for every given page it extracts a value of the same conceptual attribute (i.e., target In order to solve the integration problem with noisy sources, we values with the same semantics) or a null value if the value for the define the greedy clustering Algorithm 1. attribute is missing in that page. If a correct extraction rule only extracts noise values, it is considered noisy. We also say that an Algorithm 1 A BSTRACT I NTEGRATION extraction rule erw is weak if it mixes either target values with Input: A set of locally consistent, intensionally redundant sources different semantics or target values with noise values. with separable physical attributes. Unsupervised wrapper generators are powerful enough to infer Output: The correct set of mapping M. the correct extraction rules needed to cover the data exposed by what we call regularly structured websites. Let G = (N, E) be a graph where every attribute ai for every source Si ∈ S is a node n ∈ N . For every pair of distinct nodes P ROPERTY 4. Regularly structured sources: The sources S = ai , aj ∈ N such that S(ai ) 6= S(aj ) add an edge e between them {Si } are regularly structured w.r.t. a given unsupervised wrapper to E and let d(ai , aj ) be the weight of e. generator system W, if W generates for each source Si ∈ S a set Let m(ai ) be the mapping containing the attribute ai . of rules wi containing all the correct rules. 1. Add to M a mapping m = {ai } for each node ni ∈ N , However, wrapper generators cannot automatically identify, among the generated rules, which are the correct ones. They also produce 2. insert in a list L the edges E, weak rules, since, at wrapper generation time there is not enough information to automatically establish if a rule is either correct or 3. sort L by the weight of the edges in ascending order, weak. The integration algorithm has been presented considering 4. for each edge (a1 , a2 ) ∈ L: only the correct rules (i.e., physical attributes). However, noisy rules, if considered along with the correct ones, are harmless as (a) let m be the union of the attributes in m(a1 ) and m(a2 ) they can be identified and deleted during the integration step. They will eventually generate singleton mappings of size one since the (b) if in m there is no pair of ai , aj such that S(ai ) = distance between a noisy rule and a correct rule prevent them from S(aj ) grouping. Similar arguments apply for distances amongst noisy (c) then add m and remove m(a1 ), m(a2 ) from M rules. (d) else break. Weak rules require a more detailed discussion, and unfortunately they cannot be identified and disregarded at wrapper generation step. In the following we show that, if we keep the same assump- We are now ready to prove the correctness of the integration al- tions introduced for the integration problem, we can always identify gorithm. weak rules during the integration step. L EMMA 3.1. A BSTRACT I NTEGRATION is correct. 3.3 Extraction Problem The extraction problem is defined as follows: P ROOF. Moved to Appendix A. P ROBLEM 2. Extraction Problem: given a set of sources S = A BSTRACT I NTEGRATION is O(n2 ) over the total number of {Si }, produce a set of wrappers W ∗ = {wi }, such that wi con- physical attributes, in fact most of the time is required to create tains all and only the correct rules for Si . the edges of the graph G. In the following we introduce the extraction problem, that is, We now describe how we leverage the redundant information how to get the physical attributes we considered as input of the among different sources to identify and filter out the weak rules. integration problem. Let eri and erj be two extraction rules. We say that two extraction rules “overlap” if they extract from a page the same occurrence of 3.2 Extraction Rules the same string. In this case, one of them must be a weak rule. In In our framework, a data source S is a collection of pages S = other terms, if many rules are extracting the same value occurrence p1 , . . . , pn from the same website, such that each page publishes from at least a page, only one of them is a correct rule and all the information about one object of a real-world entity of interest. others are weak ones. With an abuse of notation, we will say that We distinguish between two different types of values that can er ∈ A to state when an extraction rule extracts at least a correct appear in a page: target values, that is, values that are derived from value of the conceptual attribute A. Notice that, as a weak rule the hidden relation T , and noise values, that is, values that are not erw can extract values from n conceptual attributes, we can say of interest for our purpose (e.g., advertising, template, layout, etc). erw ∈ A1 , . . . , An . 4. NON-OVERLAPPING SOURCES So far we have simplified the discussion by hypothesizing that every source publishes data about every object in I. In this section we remove this simplification and use IiA to denote the subset of I for which Si provides values of the conceptual attribute A. The distances amongst physically attributes from several sources have been computed using an instance-based metric that relies on Figure 3: The Extraction algorithm in action. the availability of a set of shared instances between the involved sources. Therefore, if we want to compute the direct distance be- tween two attributes d(ai , aj ), with S(ai ) = Si and S(aj ) = Sj , These intuitions are applied in the following greedy algorithm we need a non-empty overlap of objects between Si and Sj (Si 6= which solves the extraction problem: Sj ), otherwise we consider d(ai , aj ) = ∞. To formalize this aspect, and given a positive integer parameter Algorithm 2 A BSTRACT E XTRACTION q, let OVq,A (Si , Sj ) be a predicate true iff |IiA ∩ IjA | ≥ q, i.e. Input: A set of locally consistent, intensionally redundant sources both Si and Sj publish a value of the attribute A for a shared set of with separable physical attributes; the set of wrappers W produced at least q instances. by a wrapper generator system W w.r.t. which the sources are reg- Intuitively, OVq,A (Si , Sj ) is true if we consider Si and Sj to ularly structured. share enough instances (at least q) to be directly comparable on A’s Output: A set of wrappers W ∗ that do not contain weak rules. values. The value of q has to be chosen according to a trade-off: the higher the value, the more reliably the instance-based distance 1. while there is a er ∈ W which is not marked as correct: would perform, as it can be computed over a larger set of shared instances; however, it is possible that a lower number of sources (a) let d(V (eri ), V (erj )) be the minimal distance between will be directly comparable. the values of two extraction rules in W and at least one We are now ready to tackle the main issue: how to the compute of them is not marked as correct the distance when the direct distance is not defined? i.e., the overlap (b) mark eri and erj as correct, (they are correct rules) is not sufficient or not available at all and OVq,A (Si , Sj ) does not (c) remove from W all the rules that overlaps with eri hold. (they are weak rules) We introduce the indirect distance d by leveraging the inter- mediate sources sharing instances with two sources not directly (d) remove from W all the rules that overlaps with erj having enough overlap. Given a third source Sw , such that both (they are weak rules) OVq,A (Si , Sw ) and OVq,A (Sw , Sj ) hold, as for the shortest path 2. now W is W ∗ . among two points is a straight line, we can easily write: d(ai , aj ) ≤ d(ai , aw ) + d(aw , aj ). In this case, we have an upper bound for d(ai , aj ) that we call indirect distance, based on the availability of The algorithm A BSTRACT E XTRACTION takes as input a set of two direct distances between (Si , Sw ) and between (Sw , Sj ). wrappers W and computes W ∗ which does not contain weak rules. In the previous example we used just one intermediate source To explain the algorithm, we rely on the example in Figure 3. Con- (Sw ); the same principle can be trivially extended to a generic sider two websites and two objects (say, two stock quotes) that are number of intermediate sources. However, the more intermedi- published by both sites. In the figure, a circle represents the values ate sources are involved, the less precise is the bound imposed by extracted by an extraction rule and its number represents the web- d(ai , aj ). In the case that we have multiple possible indirect dis- site is has been executed on. For example, in the diagram on the tances, the bound chosen is the smallest one. ∗ left, the dark circle marked with 1 extracts from the first website the We call OVq,A the transitive closure of OVq,A : OVq,A (Si , Sj ) values 10 for the first object and 11 for the second object. Notice is true iff it is possible to compute a distance (direct or indirect) that the input of the algorithm are the circles annotated with the ex- between Si , Sj for the attribute A. If two attributes ai and aj tracted values and the website of provenience. Let say now that in are not comparable over the same conceptual attribute A, that is, ∗ step (a) the algorithm identifies the dark circles as the closest ones, OVq,A (Si , Sj ) is false, we set d(ai , aj ) = ∞. and mark them as correct in step (b). At this point both the circles Let S(A) denote the set of websites in S publishing values of with horizontal stripes overlap with correct rules and can therefore the conceptual attribute A ∈ A : a set of websites S are called be removed in steps (c) and (d). In the diagram on the right we extensionally redundant if the following property holds: show the resulting scenario: the dark circles are now the closest P ROPERTY 5. Extensional redundancy: rules and are marked as correct. The remaining dashed circles do ∀A ∈ A OVq,A ∗ = S(A) × S(A) not match at all (i.e., they are noisy rules) and raise the creation ( the overlap of websites’ objects allows the computation of indirect of two singleton mappings. To prove that the above algorithm is distances ) correct we introduce the following lemma: Essentially, it is required that the indirect-distance d can be com- L EMMA 3.2. A BSTRACT E XTRACTION is correct. puted between any pair of physical attributes. Observe that analogously to rare-attributes, a concept of rare- P ROOF. Moved to Appendix A. instances could be introduced. Anyway, the crawler [3] used during the experiments gathers up extensionally redundant websites. A BSTRACT E XTRACTION is O(n2 ) over the total number of ex- All the results previously obtained continue to hold with the traction rules generated by the automatic wrapper generation sys- new definition of distance, and can be applied even in presence of tem. Like in the case of A BSTRACT I NTEGRATION, most of the sources that do not contain all the objects I of the hidden relation time is spent computing distances between the extracted values. T provided that the input sources are extensional redundant. How- ever, in order to obtain a solution with indirect distances, we in- out considering the redundancy of information that occurs at the crease the complexity of the algorithms from quadratic to cubic, as instance-level. we reduce the computation of the distance function to the problem The exploitation of structured web data is the primary goal of of finding shortest paths in a graph by modeling physical attributes WebTables [6] and ListExtract [10], which concentrate on data pub- as nodes and the distances among them as weighted edges. This lished in HTML tables and lists, respectively. Compared to infor- can be solved with the Floyd–Warshall algorithm [14]. mation extraction approaches, WebTables and ListExtract extract relations with involved relational schemas but it does not address the issue of integrating the extracted data. 5. EXPERIMENTAL EVALUATION Cafarella et al. have described a system to populate a probabilis- In this section we present a preliminary experimental evaluation tic database with data extracted from the web [4]. However, the of our model, conducted by using a special-purpose crawler [3] to data are retrieved by TextRunner [1], an information extraction sys- collect 100 websites over three application domains: Soccer, Video- tem that is not targeted to data rich web pages as ours. Octopus [5] games, and Finance. Each source consists of tens to thousands and Cimple [13] support users in the creation of data sets from web of pages, and each page contains detailed data about one object data by means of a set of operators to perform search, extraction, of the corresponding entity. For each domain, we then selected data cleaning and integration. Although such systems have a more the 20 largest sources and manually verified the hypothesis of our general application scope than ours, they involve users in the pro- generative process. cess, while our approach is completely automatic. Intensional Redundancy We start evaluating the redundancy at the schema level. We observe that in the soccer and video-game do- mains the majority of the conceptual attributes are not rare. In fact, 7. REFERENCES in the soccer domain, only 25% of the attributes are rare and they [1] M. Banko, M. Cafarella, S. Soderland, M. Broadhead, and mostly come from websites that are not only about soccer players. O. Etzioni. Open information extraction from the web. In For example, a website containing info about olympic athletes ex- IJCAI, 2007. poses the attribute medals, while a club website exposes the debut date only for the players coming from its own youth academy. For [2] L. Blanco, M. Bronzi, V. Crescenzi, P. Merialdo, and the video-games the percentage of rare attributes is slightly over P. Papotti. Redundancy-driven web data extraction and 30% of the total, in this case rare attributes come from distinct in- integration. In WebDB, 2010. formation with very similar semantics, such as difficulty and learn- [3] L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. ing curve. Finally, in the stock quote scenario the percentage of Supporting the automatic construction of entity aware search rare attributes is over 40% of the total. This is not surprising, as engines. In WIDM, pages 149–156, 2008. in financial domain there is a large number of attributes (89 for 20 [4] M. J. Cafarella, O. Etzioni, and D. Suciu. Structured queries sources) due to the presence of many indicators used for technical over web text. IEEE Data Eng. Bull., 29(4):45–51, 2006. analysis. For this domain, 20 sites are sufficient only to get a very [5] M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data rough estimation of the set of attributes published by the sites of integration for the relational web. PVLDB, 2(1):1090–1101, the domain. We expect that the percentage of rare attributes would 2009. significantly drop as the number of web sources increases. [6] M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Extensional Redundancy Within the same domain, several ob- Y. Zhang. Webtables: exploring the power of tables on the jects are shared by several sources. The overlap is almost total for web. PVLDB, 1(1):538–549, 2008. the stock quotes, while it is more articulated for soccer players and [7] L. Chen, S. Ye, and X. Li. Template detection for large scale video-games as these domains include both large popular sites and search engines. In Proceedings of the 2006 ACM symposium small ones. Over the 100 sources, we computed that each soccer on Applied computing, SAC ’06, pages 1094–1098, New player object appears on average in 1.6 sources, each video-game York, NY, USA, 2006. ACM. in 24.5 sources, and each stock quote in 92.8 sources. In particular, [8] S.-L. Chuang, K. C.-C. Chang, and C. X. Zhai. OVq,A (Si , Sj ) is true with q = 5 for all the non rare attributes for Context-aware wrapping: Synchronized data extraction. In all the websites. VLDB, pages 699–710, 2007. Errors and Thresholds We manually verified the thresholds for [9] A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. the conceptual attributes of the three domains. As expected, those Duplicate record detection: A survey. IEEE Trans. Knowl. have very different values, depending on the domain and the at- Data Eng., 19(1):1–16, 2007. tribute considered. As an example, in the finance domain a thresh- [10] H. Elmeleegy, J. Madhavan, and A. Y. Halevy. Harvesting old of 0.023 is needed for the Max value of a stock quote (0.029 for relational tables from lists on the web. PVLDB, the Min), while for the Volume a threshold of 0.5 is sufficient. In 2(1):1078–1089, 2009. the soccer and video-games cases, the thresholds are higher, such [11] J. Madhavan, S. Cohen, X. L. Dong, A. Y. Halevy, S. R. as 0.44 for PlayerName (or video-game Title) and 0.36 for his Jeffery, D. Ko, and C. Yu. Web-scale data integration: You BirthCountry. More importantly, we were able to verify that the can afford to pay as you go. In CIDR 2007, pages 342–350, separable semantics property is always verified. 2007. [12] A. D. Sarma, X. Dong, and A. Y. Halevy. Bootstrapping 6. RELATED WORK pay-as-you-go data integration systems. In SIGMOD Our techniques are related to projects on the integration of web Conference, pages 861–874, 2008. data, such as PAYA S YOU G O [12]. However, the proposed inte- [13] W. Shen, P. DeRose, R. McCann, A. Doan, and gration techniques are based on the availability of attribute labels, R. Ramakrishnan. Toward best-effort information extraction. while our approach aims at integrating unlabeled data from web In SIGMOD Conference, pages 1031–1042, 2008. sites. TurboWrapper [8] has similar limits: it relies on syntactic [14] S. Skiena. The Algorithm Design Manual (2. ed.). Springer, structure of attributes (e.g. the ISBN number for books) with- 2008. APPENDIX < d(Vi [0, . . . , n + 1], Vk [0, . . . , n + 1]). A. PROOFS We start with a preliminary lemma needed to prove the correct- ness of the presented algorithms. L EMMA A.2. A BSTRACT I NTEGRATION is correct. L EMMA A.1. I NTRAC LOSERT HAN I NTER. P ROOF. When the property separable semantics holds, the weights ∀eri∗ , erj∗ ∈ A1 , erk ∈ A2 d(V (eri∗ ), V (erj∗ )) < d(V (eri∗ ), V (erk )) of the edges among attributes with different semantics are always higher than the weights of the edges among attributes with the same P ROOF. The extraction rule erk can be correct or weak. We semantics. This implies that the edges in L are divided in two sub- prove the lemma for the two cases: lists. In the first sublist (lower weights) we have pairs of attributes 1. erk is correct (erk∗ ): consider eri∗ , erj∗ ∈ A1 and erk∗ ∈ A2 that have the same semantics. We can therefore add to the solu- when the property separable semantics holds. tion all the pairs in the first sublist. In the second sublist (higher By definition: ∀eri∗ , erj∗ ∈ A1 ∃ tA1 : d(V (eri∗ ), V (erj∗ )) < weights) we have pairs of attributes with different semantics and tA1 we need to avoid to add an edge from this sublist to the solution. Separable semantics: ∀A1 , A2 , eri∗ ∈ A1 , erk∗ ∈ A2 : i 6= The problem here is that it is not know a-priori where the second k ∧ A1 6= A2 ⇒ d(V (eri∗ ), V (erk∗ )) > max(tA1 , tA2 ) sublist starts. But we know that when the algorithm gets to the first We can derive: edge of the second sublist, all and only the attributes with same semantics have been grouped in mappings. d(V (eri∗ ), V (erj∗ )) < tA1 ≤ max(tA1 , tA2 ) < Therefore the partial solution is correct. We now need to show that the algorithm stops at the first edge of < d(V (eri∗ ), V (erk∗ )). the second sublist. The first edge in the second sublist is an edge Therefore: between two mappings m1 , m2 with different semantics. When the property intensional redundancy holds, there must be a source d(V (eri∗ ), V (erj∗ )) < d(V (eri∗ ), V (erk∗ )). which publishes two attributes ai , aj such that they are contained in m1 and m2 , respectively. By the local consistency of sources there 2. erk is weak (erkw ): in the following we treat single values cannot be a mapping that contains ai , aj , and therefore the first as singleton vectors that we denote with V [i, . . . , j] the sub- edge of the second sublist is detected and the algorithm ends. vector of values for V from index i (included) to index j (ex- cluded). We first introduce a monotonicity property of the L EMMA A.3. A BSTRACT E XTRACTION is correct. distance function. Given two vectors V1 and V2 with n values and a distance d(V1 , V2 ) between them, let V20 be a copy of P ROOF. In any iteration of step (a) we select two correct ex- V2 . If we replace the i-th element V2 [i] with a new element traction rules er1∗ , er2∗ ∈ A1 . This is equivalent to show that if V2 [i]0 such that d(V1 [i], V2 [i]) < d(V1 [i], V2 [i]0 ) it follows we list the pairs of extraction rules in ascending order, the first that d(V1 , V2 ) < d(V1 , V20 ).4 pair is certainly one with correct extraction rules. Suppose, by ab- In this second case erkw is a weak rule, that is, it can po- surd, that the first pair contains a weak rule. This contradicts the tentially contains values taken from A1 , A2 , or any other I NTRAC LOSERT HAN I NTER Lemma. A. We consider the instance-aligned vectors of values Vk0 = Every time two correct extraction rules er1∗ or er2∗ are chosen, all V (erkw ), Vi0 = V (eri∗ ) and Vj0 = V (erj∗ ) and we remove the weak rules containing at least a value in common with er1∗ or from the analysis the instances where erkw , eri∗ , and erj∗ ex- er2∗ are removed (steps (c) and (d)). Therefore, after the algorithm tract the same value: let Vk , Vi and Vj be the vectors with the has chosen all the correct rules, there cannot be a weak rule in W as remaining values. As errw cannot contain only values coming weak rules mix values shared with correct rules and they have been from A1 (otherwise it would not be a weak rule, but a cor- discarded as soon as the correct rules have been identified. rect extraction rule of A1 ) the length of these vectors must be greater than zero, and notice also that Vk0 now does not con- tain any value coming from A1 (they have been all removed). We show now by induction on the length of the vectors that d(V (eri∗ ), V (erj∗ )) < d(V (eri∗ ), V (erkw )). Base case: let Vk [0] be the first value for Vk . We know that it is a correct value for a conceptual attribute different from A1 . Therefore, for the property we just showed in the previous case: d(Vi [0], Vj [0]) < d(Vi [0], Vk [0]). Inductive step: the inductive hypothesis is d(Vi [0, . . . , n], Vj [0, . . . , n]) < d(Vi [0, . . . , n], Vk [0, . . . , n]). We show that it is true for n + 1 elements of the vectors. Again, for the property we just showed d(Vi [n + 1], Vj [n + 1]) < d(Vi [n + 1], Vk [n + 1]) holds. For the monotonicity property of the distance function, it is true that d(Vi [0, . . . , n + 1], Vj [0, . . . , n + 1]) < 4 This is a natural property of the Euclidean distance.