Noah: Creating Data Integration Pipelines over Continuously Extracted Web Data Valerio Cetorelli Valter Crescenzi UniversitΓ  Roma Tre UniversitΓ  Roma Tre valerio.cetorelli@uniroma3.it valter.crescenzi@uniroma3.it Paolo Merialdo Roger Voyat UniversitΓ  Roma Tre UniversitΓ  Roma Tre paolo.merialdo@uniroma3.it roger.voyat@uniroma3.it ABSTRACT In order to contain the crowdsourcing costs, the proposed We present Noah, an ongoing research project aiming at devel- approach leverages two techniques. First, it exploits the inherent oping a system for semi-automatically creating end-to-end Web redundancy of Web sources to automatically find correct domain data processing pipelines. The pipelines continuously extract and information: data published by several independent sources are integrate information from multiple sites by leveraging the re- more likely to be correct and can be easily discerned by noisy dundancy of the data published on the Web. The system is based or non-relevant data [8, 15]. Secondly, it exploits the collected on a novel hybrid human-machine learning approach in which data to continuously train ML models. Those ML models are pro- the same type of questions can be interchangeably posed both gressively introduced in the form of automatic responders that to human crowd workers and to automatic responders based on replace crowd workers [1, 30], and are continuously evaluated machine learning (ML) models. Since the early stages of pipelines, during each step of the data processing pipelines: only respon- crowd workers are engaged to guarantee the output data quality, ders that become sufficiently reliable are fully deployed in the and to collect training data, that are then used to progressively operations of the created pipelines. train and evaluate automatic responders. The latter are fully de- ployed into the data processing pipelines to scale the approach and to contain the crowdsourcing costs later. The combination of guaranteed quality and progressive reductions of costs of the pipelines generated by our system can improve the investments and development processes of many applications that build on the availability of such data processing pipelines. Figure 1: Web detail pages in the Smartphone domain. 1 INTRODUCTION AND MOTIVATION The Web is the largest knowledge base ever built by humans. Problem Description. Given a set of sources S = {𝑆 1, 𝑆 2 . . .} However, most of the data on the Web are not directly available from the same domain (e.g., Smartphones); each source 𝑆𝑖 is spec- to applications, unless complex data extraction and integration ified by means of 𝑛𝑖 URLs of detail pages about domain objects pipelines are set-up. Creating these pipelines to build structured (e.g., IPhone 12, Mi 10T). By detail page we mean a page report- knowledge bases and continuously maintain them in a cost effec- ing information about a particular object, the topic entity [29] tive way is still a challenging problem. Currently, most projects of the page, on which it publishes values of several attributes. fulfill their data processing needs by means of case-by-case solu- An example of detail pages from two sources, about the same tions that cannot be reused across projects. IPhone 12 domain object is shown in Figure 1 where the values of This paper presents Noah, a research project that aims at devel- several attributes of interest such as Model, Memory, Price are oping a system for creating and maintaining over time end-to-end highlighted. data processing pipelines for continuously extracting and inte- A domain includes a set of objects O = {π‘œ 1, π‘œ 2, . . .} and a set grating Web data. Noah is based on an hybrid human-machine of attributes A = {𝐴1, 𝐴2, . . .} which will be populated with data learning approach, whose goal is to guarantee the quality of pro- extracted from the pages of the sources belonging to that domain. cessed data by leveraging feedbacks provided by human crowd New attributes and new objects of a domain can be discovered workers. Our approach can be classified in the realm of Open as new sources are considered part of the domain. Information Extraction [31], because it aims at extracting and Each source publishes detail pages reporting the values of a integrating information both at the instance (objects) and at subset of domain attributes, for a subset of domain objects. We the schema (attributes) levels into an internal knowledge base use the terms source attributes or source objects when we want to (IKB) that is created, populated and maintained for every domain. denote the version of a domain attribute or object as published Indeed, if new sources are incrementally added to an already by a source, i.e., we are referring to the occurrences of attribute generated pipeline, the system is able to discover new entities values about an object as published by a source. It is worth notic- and new attributes from the aforementioned sources. ing that some domain attribute can be published, possibly with inconsistencies amongst the provided values, by several sources, Β© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed- e.g., Model, while other attributes, e.g., ReviewScore or Price, ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus) have values which are inherently source-specific. on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) In the following, we identify source objects by means of the URL of the detail page hosting its data, and we identify source Figure 2: Running Example β€” The Smartphones domain includes 2 sources crawled at 𝑛 instants. Over each source 6 correct extraction rules working on several detail pages are given: π‘Ÿ 𝑖𝑗 (𝑗 = 1, . . . , 6) denotes the 𝑗-th rule working on source 𝑆𝑖 , each extracting the value of a source attribute from a detail page associated with a source object. For example, 𝑝 31 indicates the page about IPhone 11 from source 𝑆 1 and rule π‘Ÿ 21 extracts the Model from every page of the same source. At every time 𝑑, the values extracted from the two sources are conveniently depicted as organized in tables: each row of the table is associated with a detail page of the source, and each column is associated with an extraction rule around the same source. The set of domain attributes includes: Model, Brand, Price, Memory, Camera 1, Camera 2. Correct linkages can be represented as pairs of pages about the same domain objects: {(𝑝 11, 𝑝 12 ), (π‘π‘š 1 , 𝑝 2 )}. Correct source attribute matches can be represented as pair of 4 correct extraction rules: {(π‘Ÿ 1 , π‘Ÿ 1 ), (π‘Ÿ 2 , π‘Ÿ 2 ), (π‘Ÿ 4 , π‘Ÿ 3 ), (π‘Ÿ 5 , π‘Ÿ 42 ), (π‘Ÿ 61, π‘Ÿ 52 )}. 1 2 1 2 1 2 1 attributes by means of a unique, within the domain, identifier The focus of our research project covers the three problems of the extraction rule that is capable of locating its value from that we believe are at the core of any Web data integration the detail page. By extraction rule we mean a function extracting pipeline: extraction, matching, and linkage. It does not include, at most one value from a detail page. It does not matter the on one hand, the sources discovery problem, and the automatic formalism, e.g., XPath expressions, in which it will be specified. synthesis of crawling programs; on the other hand, it does not Our goal is that of continuously extracting data of guaranteed include the data fusion problem. level of quality from the detail pages composing to sources. The Our solution can help several projects that need to set up and data are reorganized into an Integrated Knowledge Graph (IKB) maintain over time Web data processing pipelines, but require a while minimizing the overall costs. As a measure of data quality, guaranteed quality of the pipelines’ output data to be business we will use standard measures such as precision, recall, and 𝐹 - meaningful. measure over integrated data [23]. As a measure of the cost, the Clearly, the amount of work outsourced to crowd workers goal is that of minimizing the crowdsourcing costs [5, 27]. to guarantee the quality level largely depends on the inherent In IKB the following information will be available: (linkages characteristic of the domain: those containing static attributes and matches) how the source attributes and objects are respec- that are largely redundant from source to source can dramatically tively mapped to the domain attributes and objects; (values prove- simplify domain data detection, extraction and schema matching; nance) the source attribute values for every object in the domain. an attribute working as a soft identifier across several sources can The problem we want to solve is that of continuously creating contribute significantly to reduce the cost of the record linkage K 𝑑 , that is an IKB at every time 𝑑 in which the snapshots of the task for a domain (i.e, books’ ISBN). detail pages from every source in a domain D are gathered. We Unfortunately, it turns out that many interesting domains (e.g., illustrate the problem definition by means of a running example job postings, real estates, . . . ) do not exhibit such redundancy shown in Figure 2. and the type of redundancy that the system has to exploit is at an intensional level, i.e., type and format of values, range of values, labels of extracted data. Generally speaking, separating 2 SCOPE, OPPORTUNITIES, CHALLENGES domain data from other information become largely dependent on the context in which the attributes are proposed, and on Building and maintaining effective data processing pipelines over the availability of human feedback to check the correctness of Web data is a challenging problem for several reasons. First, Web proposed hypotheses. sources are autonomous and remote: they can unpredictably change and therefore break all the extraction rules created on previous versions of the same source to extract data. Second, the Redundancy as OpenIE Enabler set up of an integration pipeline requires to solve many inter- The redundancy plays a fundamental role in our system to keep related tasks, each of which has motivated flurry of research the crowdsourcing costs at reasonable levels. Whenever redun- works, including: sources discovery, data extraction, schema match- dancy of data across sources is properly detected and exploited, ing, record linkage, data fusion, data labeling, and data cleaning. domain data can be discerned by other noisy or out-of-domain in- Each of these problems has been extensively studied over the formation. For example, WEIR [4] assumes that linkages between last decades, with tens, if not hundreds in some cases, of well- collection of pages from two sources are already known as part recognized research works [6, 13, 19, 34, 39]. of the input, and then it exploits the redundancy of distinct and independent sources that publish information about the same adopted ML algorithms but lack the amount and quality of train- objects and attributes to automatically find correct extraction ing data, and the validation, needed to guarantee the desired rules and schema matches. output quality. Noah aims at escalating to the largest possible extent the use of redundancy for extracting and integrating Web data as pioneered by WEIR. It will exploit at least the following forms for redundancy: Intensional several sources publish the same domain at- tributes Extensional several sources publish information about the same domain objects Temporal a source publish data about the same domain objects and attributes over time Intra-source a source can publish data about the same ob- jects in pages of distinct type, e.g., a result page containing Figure 3: Overview of Noah System & Pipelines created snippet of records with most relevant attributes plus link to detail pages containing all attributes [21] At the same time, and with the help of human feedback, Noah 3 NOAH SYSTEM AND PIPELINES aims at overcoming WEIR’s limitations by relaxing its rather The Noah system supports the semi-automatic generation of strict underlying assumptions on the input domain: WEIR re- end-to-end Web data processing pipelines over several domains. quires that enough intensional and extensional redundancy is Figure 3 shows how the system can generate and operate many available to discern all domain data from all other information. pipelines at the same time, each having an IKB that is progres- WEIR and Noah falls in the realm of the OpenIE approaches [3, sively and continuously populated with data coming from the 4, 16, 29, 33, 37]: unlike the ClosedIE approaches [18, 20, 25, 28] sources of the domain on which it operates. Our system will in- where the managed knowledge base does not grow in terms of teract with external systems by means of two major components: subjects and predicates but only in terms of values, new schema the Crawler, that continuously downloads snapshot of pages from information, e.g., new domain attributes, can be progressively every source with a frequency specified by a cron expression; discovered while populating the knowledge base with entities and the Crowd Manager, that manages the interactions with a and values of schema already known. crowdsourcing platform. There are two main differences between Noah and other Ope- During operations, Noah will generate pipeline queries for nIE [29] systems: first, we do not require a pre-populated Knowl- the responders engaged by the crowdsourcing platform. The edge Base, as we start from an empty IKB and we populate it as responders will contribute to solve the system tasks needed to set new sources over the domain; second, we aim at continuously up and maintain new pipelines: for example, tasks are needed to extracting and integrating data [11], as we believe that the tempo- select initial extraction rules over every domain source, select and ral setting is important both for business reasons (many projects label the source attributes, finding the linkages between source need continuous stream of data rather than snapshots), and for objects to a common mediated domain object, and matching the taking into the main problem definition the maintenance costs of source attributes across several sources to a mediated domain the generated pipelines over time, costs that are largely neglected attribute. in many research proposals [29]. Despite many of the problems that need to be tackled to cre- System Tasks ate our pipelines have already been extensively covered in the The main system tasks that need to be tackled to set up a Noah research literature, we believe that semi-automatizing the cre- pipeline are shown in Figure 4: Page Linkage, Data Extraction, ation of Web data processing pipelines can be still considered a Schema Matching, and Object Linkage. relevant problem [10]. Page Linkage aims at obtaining a first approximate top-π‘˜ page We argue that if the costs and the guaranteed level of qual- linkages. Two pages have a linkage if they both publish data ity [17] are explicitly considered, many projects relying on data related to the same domain object. processing pipelines can be re-conducted into a much more con- trollable investment and validation process, and their overall Example 3.1 (Page Linkage). In Figure 2 we can see two possi- feasibility can be significantly improved because many business ble page linkages at time 𝑑𝑛 : {(𝑝 11, 𝑝 12 ), (π‘π‘š 1 , 𝑝 2 )}. Their distances, 4 projects are strongly and directly affected by the cost of creating i.e., 0.09 and 0.12, are shown at the top of Figure 5a. and maintaining the underlying Web data processing pipelines. Data Extraction aims at finding all the correct extraction rules. Moreover, we believe that by posing to human and automatic It generates all the possible extraction rules and discover the responders the same type of queries, they become interchange- correct ones by exploiting the redundancy of published data able enough to motivate the study of new deployment methodolo- across several independent sources [4] when available, while gies for Web data processing pipelines. The goal of such method- querying the responders [7] to confirm uncertain hypotheses. ologies is to progressively lowering the crowdsourcing costs by Schema Matching aims at finding matches between extraction means of machine-learning techniques while keeping under con- rules by exploiting an instance-based distance measure between trol the output quality level since the early stages of the deployed source objects. The instance-based distance between two extrac- pipelines. Indeed, many development projects often experience tion rules assumes the availability of correct object linkages to unpredictable and erratic time-to-market (TTM) and return-on- align source objects related to the same domain object, as pro- investment (ROI) because, especially in the early stages, they duced in output by the next system task: the distance is obtained Figure 4: Running example (Pipeline example with queries): tasks provided by system and query generated for hybrid human-machine responders. by averaging the distance between extracted values over all the with values extracted from two pages of distinct sources, e.g., aligned detail pages. using extraction rules (π‘Ÿ 51, π‘Ÿ 42 ). These are two detail pages con- sidered in a linkage, and "MI 10" is the name associated with the Example 3.2 (Schema Matching). Consider source 𝑆 1 and 𝑆 2 corresponding domain object. at time 𝑑𝑛 and the set of page linkages {(𝑝 11, 𝑝 12 ), (π‘π‘š 1 , 𝑝 2 )} in 4 Figure 2: possible matches are {(π‘Ÿ 1 , π‘Ÿ 1 ), (π‘Ÿ 4 , π‘Ÿ 3 ), (π‘Ÿ 4 , π‘Ÿ 32 )}. The 1 2 1 2 1 Example 3.5 (Page Linkage Query). A query such as ’Do these pairwise attribute distances, i.e., 0.19 and 0.22, are shown at two pages refer to the same object?’ posed to human responders in Figure 4 can validate or refute a page linkage (π‘π‘š1 , 𝑝 2 ). In order the top of Figure 5c. 4 for the query to be as simple as possible [35], we can show the Object Linkage aims at finding linkages between source objects user a screenshot of the original pages. by exploiting a pairwise attribute distance measure between source attributes. The pairwise attribute distance between two Example 3.6 (Object Linkage Query). Unlike the case of page source objects assumes the availability of correct schema matches linkage tasks above, here the query is posed directly on source across the extraction rules to align source attributes related to objects with extracted values. A query such as ’Do these 2 objects the same domain attribute, as produced in output by the previous refer to the same?’ posed to human responders in Figure 4 can system task: the distance is obtained by averaging the distance validate or refute an object linkage (𝑝 11, 𝑝 12 ). To make the query as between the two values over all matching attributes. simple as possible for an human responder, it is shown together We name the linkage/matching loop of system tasks Link- with two records whose attributes have been already aligned by age/Matching Duality; we further discuss it in Section 3.1. leveraging the results of a schema matching task. The tremendous success of crowdsourcing [24] can be partially Pipeline Queries explained by saying that human supervision can represent the For every system task necessary to set up and maintain a pipeline, essential final ingredient to unmask those problems really hard to Noah tries to solve it by using a human-in-the-loop approach [9, solve through automatic algorithms but that can be transformed 26]: unsupervised algorithms will generate most-likely hypothe- into rather simple questions for human workers. However, it is sis based on the available redundancy. These hypothesis are later well known that in practice, the availability and the accuracy of confuted or validated by means of queries posed to responders, crowd workers, especially of unskilled ones, is strongly depen- initially only human responders, and later, also by using auto- dent on the way the questions are posed and rewarded [35]. One matic responders based on ML models that have been trained of the Noah goal is that of exploiting IKB, which is progressively with the data collected while operating the Noah pipeline (see built, also to make the crowdsourcing queries as simple as pos- Section 4). sible. For example, a query to check a record linkage exploits An example of the queries posed to the responders for every the schema matching already computed to make the two records system task is shown in Figure 4: Page Linkage, Data Extraction, easy to be visually compared. Schema Matching and Object Linkage. 3.1 Linkage / Matching Duality Example 3.3 (Data Extraction Query). Figure 4 shows an ex- Figure 4 shows that two important integration tasks operated by ample of query for Data Extraction tasks. The uncertainty of an Noah pipelines, i.e., Schema Matching and Object Linkage, are extraction rule generated by wrapper inference can be validated part of a loop in which each one assumes the availability of the by checking the extracted value on a detail page by means of a output of the other to solve its own task. Page Linkage is the query such as: "Is ’1050$’ a Price?", where Price is a candidate system task outside the loop needed for its initial triggering. label for the extraction rule and ’1050 $’ is the extracted value. We assume available two normalized distance functions pro- Example 3.4 (Schema Matching Query). Figure 4 shows that viding a value between 0 and 1 when comparing two rules, and schema matching tasks can be solved by means of queries con- two source objects (records), respectively: the instance-based dis- firming or refuting a single match: ’Do "108MP" and "20MP" tance and the pairwise attribute distance. The former compares refer to the same attribute of object "MI 10"?’. The template of two rules over the values they extract from a set of detail pages the query to support a schema matching task has been filled up which have been previously aligned, i.e., their linkages are fixed. (a) Linkages Distances (b) Page Linkages over 2 Attributes (c) Matches Distances (d) Matches over 2 Linkages Figure 5: Running example (Distance Similarity): 5a and 5c show distances in Pyramids; 5b and 5d expose relations in Cartesian Plane where ’Uncertainties’ are due to the breaking of LC with Non-separable Domain The latter compares two source objects over the values of some with source objects associated with a different domain of their attributes which have been previously aligned, i.e., their object. For computing the pairwise attribute distance, the matches are fixed. source attribute matches are fixed and already known. Example 3.7 (Normalized Distance Functions). Instance-based For domains in which such properties hold, the WEIR system distance: let (π‘π‘š 1 , 𝑝 2 ) and (𝑝 1 , 𝑝 2 ) be two given correct linkages 4 1 1 is able to match the extraction rules and build their mappings into for the detail pages associated with IPhone 12 and MI 10 source cluster of source attributes related to the same domain attribute object from source 𝑆 1 and 𝑆 2 as shown in Figure 5d. The distance by comparing all the similarity distances, while at the same time, between the rules (π‘Ÿ 51, π‘Ÿ 42 ) can be computed as follows: 𝑑 (π‘Ÿ 51, π‘Ÿ 42 ) = it can separate the correct extraction rules from noisy ones. The 𝑑 (π‘Ÿ 51 (𝑝 11 ), π‘Ÿ 42 (𝑝 12 )) + 𝑑 (π‘Ÿ 51 (π‘π‘š 1 ), π‘Ÿ 2 (𝑝 2 )) = 𝑑 ( β€˜108MP’, β€˜108MP’) + 4 4 idea is pretty simple and depicted in Figure 5: DS suggests to 𝑑 ( β€˜12MP’, β€˜14MP’) = 2.9. The normalized distance in the range [0, 1] sort the set of all possible matches (pair of extraction rules) by is 0.27. an instance-based distance leveraging the alignment of detail Pairwise attribute distance: let (π‘Ÿ 22, π‘Ÿ 21 ) and (π‘Ÿ 11, π‘Ÿ 12 ) be two given pages (see Figure 5c). Those pairs are then processed in order of correct matches for Brand and Model attributes (see Figure 5b). increasing distances: every pair of rules are merged in the same The distance between the two source objects about MI 10 PRO and mapping as long as the addition of the rules will not lead to a MI 10T can be computed as follows: 𝑑 (π‘œ 21, π‘œ 22 ) = 𝑑 (π‘Ÿ 21 (𝑝 21 ), π‘Ÿ 22 (𝑝 22 ))+ violation of the LC property, i.e., two rules (source attributes) 𝑑 (π‘Ÿ 11 (𝑝 21 ), π‘Ÿ 12 (𝑝 22 )) = 𝑑 ( β€˜XIAOMI’, β€˜XIAOMI’) +𝑑 ( β€˜MI 10 PRO’, β€˜MI 10T’) from the same source would end up being present in the same = 3.2. The normalized distance in the range [0, 1] is 0.27. output mapping (see Figure 5d). For certain domains, with suf- We revisit and propose an extension of two domain properties, ficiently overlapping sources, WEIR can automatically find the called Local Consistency and Separable Domain, underlying the correct extraction rules and their matching with rules over other formal approach presented in WEIR [4] for solving the extraction sources provided that the correct linkages between detail pages and matching problem when the page linkage is given as input. are known. Our ambition is twofold: on the one side, we aim to extend The dual algorithm will solve the problem of finding correct that approach to cover the whole trio of extraction, matching object linkages provided that correct schema matches between and linkage problems at the core of Noah pipelines; on the other source attributes are given as depicted in Figure 5: DS suggests hand, we want to relax the underlying assumptions by mean of to sort the set of all possible linkages (pair of source objects) by a the feedback provided by human crowd workers, so making the pairwise attribute distance (see Figure 5a). Those pairs are then approach adaptable to domains with more disparate character- processed in order of increasing distances: every pair of source istics that those originally covered in the WEIR project. Here objects are merged in the same linkage as long as the addition we briefly recall the two properties and sketch how we plan to of the objects processed into an existing linkage will not lead extend them. to a violation of the LC property, i.e., two source objects from Local Consistency (LC) In a source there cannot be two the same source would end up being present in the same output distinct source attributes that refer to the same domain linkage (see Figure 5b). attribute. The dual property that we additionally assume is This algorithm exploits the duality of the matching and linkage that two distinct detail pages from the same source cannot problems, in this setting, and it is at the core of integration engine publish data about the same domain object. for the Noah project. However, differently from WEIR, it does not Separable Domain (SD) In a mapping composed of several halt the integration as soon as a LC violation is detected: rather, extraction rules, each from a distinct source, and associ- it generates pipeline queries to confirm the choice, and continue ated with the same domain attribute, the instance-based the processing of all pairs in increasing order of distances, until distances between the rules of the mapping are always it is below a threshold over which no further matches/linkages smaller than distances with rules associated with a differ- are expected with meaningful distance functions. ent domain attribute. For computing the instance-based Unfortunately, as also recognized in WEIR [4], some domains distance, the object linkages are fixed and already known. have sources and attributes with very similar but semantically The dual property that we additionally assume is that in a different values (e.g., the resolution of the front/rear cameras in linkage composed of several source objects from distinct Figure 2). This situation easily lead to violation of the LC and SD sources and related to the same domain object, the pair- assumptions, and finding the mappings is a challenging problem wise attribute distances are always smaller than distances for many interesting domains. Example 3.8 (Non-separable Domains for Schema Matching). In new objects or new attributes, additional costs might be incurred Figure 2, source S1 and 𝑆 2 both have extraction rules ((π‘Ÿ 51, π‘Ÿ 61 ), to support the integration with existing IKB. and (π‘Ÿ 42, π‘Ÿ 52 ), respectively) with a low distance (Figure 5c) because We are interested to study ML techniques that could decrease camera resolutions (e.g., 1-front and 2-back) are typically within crowdsourcing costs even in absence of redundancy. The main a small range of values expressed in megapixel (MP). In Figure 5d research area is that of synthesizing automatic responders ca- it is shown that the pair of rules (π‘Ÿ 51, π‘Ÿ 61 ) at distance 0.25 violates pable of answering the same type of pipeline queries that are the LC and DS assumptions because their distance is smaller than normally posed to human responders for solving Noah tasks, the distance of (π‘Ÿ 51, π‘Ÿ 42 ) that is 0.27. with the goal of progressively replacing human responders [7] and scaling the approach up to many thousands of sources. Actually, it is well known that the Record Linkage dual prob- Unfortunately, state-of-the-art ML unsupervised techniques [40, lem, is even much more challenging than the Schema Matching 42] can be adapted to provide accurate and reliable answers to itself: the attributes containing the correct signals for considering those queries only if enough training data have been collected. two objects equivalent can change from object to object even Indeed, fairness and bias, or simply misuse of machine learning within the same source (think at smartphones of different brands algorithms, is a well-known problem in literature [12, 32] that with different policies for naming the models and differentiating affects many development projects, especially in the scenarios the features of each model). Assuming that every object in the do- which are most commonly found in practice [38]: pre-trained ML main does not lead to a separability violation is quite unrealistic, models and/or enough training data are not available up-front, so beside toy cases. that the ML models cannot be properly tuned and exhibit erratic Example 3.9 (Non-separable Domains for Object Linkage). In and unpredictable performance [41]. Figure 5b the linkage (𝑝 11, 𝑝 12 ) is uncertain due to the presence Snorkel [36] is another project exploiting the idea of leverag- of 𝑝 22 . The two values (’MI 10’ vs ’MI 10T’) extracted by rule ing human work to train ML algorithms. However, it is based on π‘Ÿ 12 from pages 𝑝 12 and 𝑝 22 differ by a single letter: the wrong the idea of engaging skilled workers in every step of the process- linkage (𝑝 12, 𝑝 22 ) violating the LC property has a pairwise attribute ing pipeline, while Noah aims at engaging non skilled workers distance of only 0, 09 which is smaller than the distance of a to whom can be interchangeably posed queries in the same form the correct linkage (𝑝 11, 𝑝 12 ), and therefore the domain is not as those posed to automatic responders. Several other projects separable. such as qodco [2] and SEER [22] have made use of crowdsourc- ing by mainly focusing on the problem of selecting the correct We believe that the violations of LC and DS assumptions can extraction rules, while Noah applies the same query control be manually fixed and that they help to find the most informative methodology for all the tasks in the considered pipelines. pipeline queries that need to be posed to external responders, i.e., It is also well known that by using automatic responders not paid crowd workers, or suitably trained automatic responders. accurate enough, it might turn out to be more expensive engaging By interleaving the dual linkage/matching algorithms in a loop them than not using them at all, as additional human workers in which external responders can contribute, as shown in Figure 4, should be engaged only to offset their wrong answers [7]. each execution can contribute to improve the accuracy of the We envision a system in which crowd workers are used for distance function used by the other task, either by improving the indirectly controlling the deployment of automatic responders, linkages used by the instance-based distance, or improving the and the two types of responders are interchangeably engaged. matches used by the pairwise attribute distance. Crowdsourcing workers contribute to collect domain data that Our vision is that with the precious help of crowdsourcing and are then used to train and evaluate automatic responders, before a loop of interleaving linkage/matching operations, the desired fully deploying them. Automatic responders will progressively target quality can be reached even in presence of non-separable replace crowd workers to scale the approach and to lower the domains: responders will be engaged to assess the quality of operating costs, but only after enough evidence that their accu- the output, and to repair the uncertain choices made by the racy does not compromise the overall guaranteed output quality integration algorithm. The linkages and matches confirmed by data. At regime, crowd workers will be minimally used only to human feedback can be frozen and exploited in the following keep monitoring the performance of automatic responders. iterations, somehow progressively solving and hence removing from the domain the linkages or matches that made the domain We have identified several novel research challenges: inseparable. 4 RESEARCH DIRECTIONS β€’ formalizing and proving the correctness of an algorithm that solves the full trio of extraction, matching and linkage In the early stages of its life, the IKB K of a new Noah pipeline tasks; might be scarcely populated. As redundancy builds up over time β€’ creating and maintaining over time the continuous Web with the addition of new sources to feed up the IKB, the accuracy data processing pipelines at low costs, with guaranteed of the extraction and integration process increases. output quality; The absence of overlapping between objects and attributes β€’ designing several independent automatic responders based published by a rather limited set of sources could limit the amount on ML models that are capable of answering queries nor- of available redundancy. In this situation, for operating the pipeline, mally posed to crowd workers; Noah would end up generating a lot of queries supporting the β€’ effectively measuring the available redundancy in a do- system tasks. As an alternative solution, Noah supports the in- main; cremental addition of a source into an existing pipeline. A new β€’ estimating from the characteristics of a domain the crowd- source might contribute to lower the overall costs if it signif- sourcing costs necessary to obtain and maintain the de- icantly overlaps with the sources already available for the do- sired output quality. main [14]. On the contrary, to integrate new sources publishing REFERENCES [22] Maeda F Hanafi, Azza Abouzied, Laura Chiticariu, and Yunyao Li. 2017. Syn- [1] Tara S Behrend, David J Sharek, Adam W Meade, and Eric N Wiebe. 2011. The thesizing extraction rules from user examples with seer. In Proceedings of the viability of crowdsourcing for survey research. Behavior research methods 43, 2017 ACM International Conference on Management of Data. 1687–1690. 3 (2011), 800. [23] Bernd Heinrich, Marcus Kaiser, and Mathias Klier. 2007. How to measure data [2] Moria Bergman, Tova Milo, Slava Novgorodov, and Wang-Chiew Tan. 2015. quality? A metric-based approach. (2007). Query-oriented data cleaning with oracles. In Proceedings of the 2015 ACM [24] Jeff Howe. 2006. The rise of crowdsourcing. Wired magazine 14, 6 (2006), 1–4. SIGMOD International Conference on Management of Data. 1199–1214. [25] Nicholas Kushmerick, Daniel S Weld, and Robert Doorenbos. 1997. Wrapper [3] Nikita Bhutani, Yoshihiko Suhara, Wang-Chiew Tan, Alon Halevy, and induction for information extraction. University of Washington Washington. Hosagrahar Visvesvaraya Jagadish. 2019. Open information extraction from [26] Guoliang Li. 2017. Human-in-the-Loop Data Integration. Proc. VLDB Endow. question-answer pairs. arXiv preprint arXiv:1903.00172 (2019). 10, 12 (Aug. 2017), 2006–2017. https://doi.org/10.14778/3137765.3137833 [4] Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2013. Ex- [27] Guoliang Li, Yudian Zheng, Ju Fan, Jiannan Wang, and Reynold Cheng. 2017. traction and integration of partially overlapping web sources. Proceedings of Crowdsourced Data Management: Overview and Challenges. In Proceedings the VLDB Endowment 6, 10 (2013), 805–816. of the 2017 ACM International Conference on Management of Data (SIGMOD [5] Valter Crescenzi, Alvaro AA Fernandes, Paolo Merialdo, and Norman W Paton. ’17). Association for Computing Machinery, New York, NY, USA, 1711–1716. 2017. Crowdsourcing for data management. Knowledge and Information https://doi.org/10.1145/3035918.3054776 [28] Colin Lockard, Xin Luna Dong, Arash Einolghozati, and Prashant Shiralkar. Systems 53, 1 (2017), 1–41. 2018. Ceres: Distantly supervised relation extraction from the semi-structured [6] Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. 2002. RoadRunner: web. arXiv preprint arXiv:1804.04635 (2018). automatic data extraction from data-intensive web sites. In Proceedings of the [29] Colin Lockard, Prashant Shiralkar, and Xin Luna Dong. 2019. Openceres: When 2002 ACM SIGMOD international conference on Management of data. 624–624. open information extraction meets the semi-structured web. In Proceedings [7] Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2019. Hybrid Crowd- of the 2019 Conference of the North American Chapter of the Association for Machine Wrapper Inference. ACM Transactions on Knowledge Discovery from Computational Linguistics: Human Language Technologies, Volume 1 (Long and Data (TKDD) 13, 5 (2019), 1–43. Short Papers). 3047–3056. [8] Nilesh Dalvi, Ashwin Machanavajjhala, and Bo Pang. 2012. An analysis of [30] Adam Marcus and Aditya Parameswaran. 2015. Crowdsourced data man- structured data on the web. arXiv preprint arXiv:1203.6406 (2012). agement: Industry and academic perspectives. Foundations and Trends in [9] AnHai Doan. 2018. Human-in-the-Loop Data Analysis: A Personal Perspec- Databases 6, 1-2 (2015), 1–161. tive. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics [31] Mausam Mausam. 2016. Open information extraction systems and downstream (HILDA’18). Association for Computing Machinery, New York, NY, USA, Arti- applications. In Proceedings of the twenty-fifth international joint conference on cle 1, 6 pages. https://doi.org/10.1145/3209900.3209913 artificial intelligence. 4074–4077. [10] AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Sanjib Das, Yash Govind, [32] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Pradap Konda, Han Li, Erik Paulson, Paul Suganthan G. C., and Haojun Aram Galstyan. 2019. A survey on bias and fairness in machine learning. Zhang. 2017. Toward a System Building Agenda for Data Integration. arXiv preprint arXiv:1908.09635 (2019). arXiv:cs.DB/1710.00027 [33] Christina Niklaus, Matthias Cetto, AndrΓ© Freitas, and Siegfried Handschuh. [11] AnHai Doan, Alon Halevy, and Zachary Ives. 2012. Principles of data integra- 2018. A survey on open information extraction. arXiv preprint arXiv:1806.05599 tion. Elsevier. (2018). [12] Pedro Domingos. 2012. A Few Useful Things to Know about Machine Learning. [34] Erhard Rahm and Philip Bernstein. 2001. A Survey of Approaches to Automatic Commun. ACM 55, 10 (Oct. 2012), 78–87. https://doi.org/10.1145/2347736. Schema Matching. VLDB J. 10 (12 2001), 334–350. https://doi.org/10.1007/ 2347755 s007780100057 [13] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin [35] Bahareh Rahmanian and Joseph G. Davis. 2014. User Interface Design for Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge Crowdsourcing Systems. In Proceedings of the 2014 International Working vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings Conference on Advanced Visual Interfaces (AVI ’14). Association for Computing of the 20th ACM SIGKDD international conference on Knowledge discovery and Machinery, New York, NY, USA, 405–408. https://doi.org/10.1145/2598153. data mining. 601–610. 2602248 [14] Xin Dong, Barna Saha, and Divesh Srivastava. 2012. Less is more: Selecting [36] Alexander J Ratner, Stephen H Bach, Henry R Ehrenberg, and Chris RΓ©. 2017. sources wisely for integration. Proceedings of the VLDB Endowment 6, 37–48. Snorkel: Fast training set generation for information extraction. In Proceedings [15] Xin Luna Dong and Divesh Srivastava. 2015. Big data integration. Synthesis of the 2017 ACM international conference on management of data. 1683–1686. Lectures on Data Management 7, 1 (2015), 1–198. [37] Michael Schmitz, Stephen Soderland, Robert Bart, Oren Etzioni, et al. 2012. [16] Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying rela- Open language learning for information extraction. In Proceedings of the 2012 tions for open information extraction. In Proceedings of the 2011 conference on Joint Conference on Empirical Methods in Natural Language Processing and empirical methods in natural language processing. 1535–1545. Computational Natural Language Learning. 523–534. [17] Wenfei Fan and Floris Geerts. 2012. Foundations of data quality management. [38] Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann, Synthesis Lectures on Data Management 4, 5 (2012), 1–217. Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska. [18] Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, 2019. Democratizing Data Science through Interactive Curation of ML Christian Schallhart, and Cheng Wang. 2014. DIADEM: thousands of websites Pipelines. In Proceedings of the 2019 International Conference on Management to a single database. Proceedings of the VLDB Endowment 7, 14 (2014), 1845– of Data (SIGMOD ’19). Association for Computing Machinery, New York, NY, 1856. USA, 1171–1188. https://doi.org/10.1145/3299869.3319863 [19] Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F Naughton, Narasimhan [39] Kai-Sheng Teong, Lay-Ki Soon, and Tin Tin Su. 2020. Schema-Agnostic Entity Rampalli, Jude Shavlik, and Xiaojin Zhu. 2014. Corleone: hands-off crowdsourc- Matching using Pre-trained Language Models. In Proceedings of the 29th ACM ing for entity matching. In Proceedings of the 2014 ACM SIGMOD international International Conference on Information & Knowledge Management. 2241–2244. conference on Management of data. 601–612. [40] Sebastian Thrun and Lorien Pratt. 2012. Learning to learn. Springer Science & [20] Pankaj Gulhane, Amit Madaan, Rupesh Mehta, Jeyashankher Ramamirtham, Business Media. Rajeev Rastogi, Sandeep Satpal, Srinivasan H Sengamedu, Ashwin Tengli, and [41] Doris Xin, Litian Ma, Jialin Liu, Stephen Macke, Shuchen Song, and Aditya Charu Tiwari. 2011. Web-scale information extraction with vertex. In 2011 Parameswaran. 2018. Accelerating Human-in-the-Loop Machine Learning: IEEE 27th International Conference on Data Engineering. IEEE, 1209–1220. Challenges and Opportunities. In Proceedings of the Second Workshop on Data [21] Jinsong Guo, Valter Crescenzi, Tim Furche, Giovanni Grasso, and Georg Gott- Management for End-To-End Machine Learning (DEEM’18). Association for lob. 2019. RED: Redundancy-Driven Data Extraction from Result Pages?. In Computing Machinery, New York, NY, USA, Article 9, 4 pages. https://doi. The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May org/10.1145/3209889.3209897 13-17, 2019, Ling Liu, Ryen W. White, Amin Mantrach, Fabrizio Silvestri, Ju- [42] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated lian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 605–615. machine learning: Concept and applications. ACM Transactions on Intelligent https://doi.org/10.1145/3308558.3313529 Systems and Technology (TIST) 10, 2 (2019), 1–19.