Noah: Creating Data Integration Pipelines over Continuously Extracted Web Data

Noah: Creating Data Integration Pipelines over Continuously Extracted Web Data ValerioCetorelli valerio.cetorelli@uniroma3.it Università Roma Tre ValterCrescenzi valter.crescenzi@uniroma3.it Università Roma Tre PaoloMerialdo paolo.merialdo@uniroma3.it Università Roma Tre RogerVoyat roger.voyat@uniroma3.it Università Roma Tre Noah: Creating Data Integration Pipelines over Continuously Extracted Web Data D23CC8A54D94908B6324C78DE58ABD45 GROBID - A machine learning software for extracting information from scholarly documents

We present Noah, an ongoing research project aiming at developing a system for semi-automatically creating end-to-end Web data processing pipelines. The pipelines continuously extract and integrate information from multiple sites by leveraging the redundancy of the data published on the Web. The system is based on a novel hybrid human-machine learning approach in which the same type of questions can be interchangeably posed both to human crowd workers and to automatic responders based on machine learning (ML) models. Since the early stages of pipelines, crowd workers are engaged to guarantee the output data quality, and to collect training data, that are then used to progressively train and evaluate automatic responders. The latter are fully deployed into the data processing pipelines to scale the approach and to contain the crowdsourcing costs later. The combination of guaranteed quality and progressive reductions of costs of the pipelines generated by our system can improve the investments and development processes of many applications that build on the availability of such data processing pipelines.

INTRODUCTION AND MOTIVATION

The Web is the largest knowledge base ever built by humans. However, most of the data on the Web are not directly available to applications, unless complex data extraction and integration pipelines are set-up. Creating these pipelines to build structured knowledge bases and continuously maintain them in a cost effective way is still a challenging problem. Currently, most projects fulfill their data processing needs by means of case-by-case solutions that cannot be reused across projects.

This paper presents Noah, a research project that aims at developing a system for creating and maintaining over time end-to-end data processing pipelines for continuously extracting and integrating Web data. Noah is based on an hybrid human-machine learning approach, whose goal is to guarantee the quality of processed data by leveraging feedbacks provided by human crowd workers. Our approach can be classified in the realm of Open Information Extraction [31], because it aims at extracting and integrating information both at the instance (objects) and at the schema (attributes) levels into an internal knowledge base (IKB) that is created, populated and maintained for every domain. Indeed, if new sources are incrementally added to an already generated pipeline, the system is able to discover new entities and new attributes from the aforementioned sources.

In order to contain the crowdsourcing costs, the proposed approach leverages two techniques. First, it exploits the inherent redundancy of Web sources to automatically find correct domain information: data published by several independent sources are more likely to be correct and can be easily discerned by noisy or non-relevant data [8,15]. Secondly, it exploits the collected data to continuously train ML models. Those ML models are progressively introduced in the form of automatic responders that replace crowd workers [1,30], and are continuously evaluated during each step of the data processing pipelines: only responders that become sufficiently reliable are fully deployed in the operations of the created pipelines. Problem Description. Given a set of sources S = {𝑆 1 , 𝑆 2 . . .} from the same domain (e.g., Smartphones); each source 𝑆 𝑖 is specified by means of 𝑛 𝑖 URLs of detail pages about domain objects (e.g., IPhone 12, Mi 10T). By detail page we mean a page reporting information about a particular object, the topic entity [29] of the page, on which it publishes values of several attributes. An example of detail pages from two sources, about the same IPhone 12 domain object is shown in Figure 1 where the values of several attributes of interest such as Model, Memory, Price are highlighted.

A domain includes a set of objects O = {𝑜 1 , 𝑜 2 , . . .} and a set of attributes A = {𝐴 1 , 𝐴 2 , . . .} which will be populated with data extracted from the pages of the sources belonging to that domain. New attributes and new objects of a domain can be discovered as new sources are considered part of the domain.

Each source publishes detail pages reporting the values of a subset of domain attributes, for a subset of domain objects. We use the terms source attributes or source objects when we want to denote the version of a domain attribute or object as published by a source, i.e., we are referring to the occurrences of attribute values about an object as published by a source. It is worth noticing that some domain attribute can be published, possibly with inconsistencies amongst the provided values, by several sources, e.g., Model, while other attributes, e.g., ReviewScore or Price, have values which are inherently source-specific.

In the following, we identify source objects by means of the URL of the detail page hosting its data, and we identify source attributes by means of a unique, within the domain, identifier of the extraction rule that is capable of locating its value from the detail page. By extraction rule we mean a function extracting at most one value from a detail page. It does not matter the formalism, e.g., XPath expressions, in which it will be specified.

Our goal is that of continuously extracting data of guaranteed level of quality from the detail pages composing to sources. The data are reorganized into an Integrated Knowledge Graph (IKB) while minimizing the overall costs. As a measure of data quality, we will use standard measures such as precision, recall, and 𝐹measure over integrated data [23]. As a measure of the cost, the goal is that of minimizing the crowdsourcing costs [5,27].

In IKB the following information will be available: (linkages and matches) how the source attributes and objects are respectively mapped to the domain attributes and objects; (values provenance) the source attribute values for every object in the domain.

The problem we want to solve is that of continuously creating K 𝑡 , that is an IKB at every time 𝑡 in which the snapshots of the detail pages from every source in a domain D are gathered. We illustrate the problem definition by means of a running example shown in Figure 2.

SCOPE, OPPORTUNITIES, CHALLENGES

Building and maintaining effective data processing pipelines over Web data is a challenging problem for several reasons. First, Web sources are autonomous and remote: they can unpredictably change and therefore break all the extraction rules created on previous versions of the same source to extract data. Second, the set up of an integration pipeline requires to solve many interrelated tasks, each of which has motivated flurry of research works, including: sources discovery, data extraction, schema matching, record linkage, data fusion, data labeling, and data cleaning. Each of these problems has been extensively studied over the last decades, with tens, if not hundreds in some cases, of wellrecognized research works [6,13,19,34,39].

The focus of our research project covers the three problems that we believe are at the core of any Web data integration pipeline: extraction, matching, and linkage. It does not include, on one hand, the sources discovery problem, and the automatic synthesis of crawling programs; on the other hand, it does not include the data fusion problem.

Our solution can help several projects that need to set up and maintain over time Web data processing pipelines, but require a guaranteed quality of the pipelines' output data to be business meaningful.

Clearly, the amount of work outsourced to crowd workers to guarantee the quality level largely depends on the inherent characteristic of the domain: those containing static attributes that are largely redundant from source to source can dramatically simplify domain data detection, extraction and schema matching; an attribute working as a soft identifier across several sources can contribute significantly to reduce the cost of the record linkage task for a domain (i.e, books' ISBN).

Unfortunately, it turns out that many interesting domains (e.g., job postings, real estates, . . . ) do not exhibit such redundancy and the type of redundancy that the system has to exploit is at an intensional level, i.e., type and format of values, range of values, labels of extracted data. Generally speaking, separating domain data from other information become largely dependent on the context in which the attributes are proposed, and on the availability of human feedback to check the correctness of proposed hypotheses.

Redundancy as OpenIE Enabler

The redundancy plays a fundamental role in our system to keep the crowdsourcing costs at reasonable levels. Whenever redundancy of data across sources is properly detected and exploited, domain data can be discerned by other noisy or out-of-domain information. For example, WEIR [4] assumes that linkages between collection of pages from two sources are already known as part of the input, and then it exploits the redundancy of distinct and independent sources that publish information about the same objects and attributes to automatically find correct extraction rules and schema matches.

Noah aims at escalating to the largest possible extent the use of redundancy for extracting and integrating Web data as pioneered by WEIR. It will exploit at least the following forms for redundancy:

Intensional several sources publish the same domain attributes Extensional several sources publish information about the same domain objects Temporal a source publish data about the same domain objects and attributes over time Intra-source a source can publish data about the same objects in pages of distinct type, e.g., a result page containing snippet of records with most relevant attributes plus link to detail pages containing all attributes [21] At the same time, and with the help of human feedback, Noah aims at overcoming WEIR's limitations by relaxing its rather strict underlying assumptions on the input domain: WEIR requires that enough intensional and extensional redundancy is available to discern all domain data from all other information. WEIR and Noah falls in the realm of the OpenIE approaches [3,4,16,29,33,37]: unlike the ClosedIE approaches [18,20,25,28] where the managed knowledge base does not grow in terms of subjects and predicates but only in terms of values, new schema information, e.g., new domain attributes, can be progressively discovered while populating the knowledge base with entities and values of schema already known.

There are two main differences between Noah and other Ope-nIE [29] systems: first, we do not require a pre-populated Knowledge Base, as we start from an empty IKB and we populate it as new sources over the domain; second, we aim at continuously extracting and integrating data [11], as we believe that the temporal setting is important both for business reasons (many projects need continuous stream of data rather than snapshots), and for taking into the main problem definition the maintenance costs of the generated pipelines over time, costs that are largely neglected in many research proposals [29].

Despite many of the problems that need to be tackled to create our pipelines have already been extensively covered in the research literature, we believe that semi-automatizing the creation of Web data processing pipelines can be still considered a relevant problem [10].

We argue that if the costs and the guaranteed level of quality [17] are explicitly considered, many projects relying on data processing pipelines can be re-conducted into a much more controllable investment and validation process, and their overall feasibility can be significantly improved because many business projects are strongly and directly affected by the cost of creating and maintaining the underlying Web data processing pipelines.

Moreover, we believe that by posing to human and automatic responders the same type of queries, they become interchangeable enough to motivate the study of new deployment methodologies for Web data processing pipelines. The goal of such methodologies is to progressively lowering the crowdsourcing costs by means of machine-learning techniques while keeping under control the output quality level since the early stages of the deployed pipelines. Indeed, many development projects often experience unpredictable and erratic time-to-market (TTM) and return-oninvestment (ROI) because, especially in the early stages, they adopted ML algorithms but lack the amount and quality of training data, and the validation, needed to guarantee the desired output quality.

NOAH SYSTEM AND PIPELINES

The Noah system supports the semi-automatic generation of end-to-end Web data processing pipelines over several domains. Figure 3 shows how the system can generate and operate many pipelines at the same time, each having an IKB that is progressively and continuously populated with data coming from the sources of the domain on which it operates. Our system will interact with external systems by means of two major components: the Crawler, that continuously downloads snapshot of pages from every source with a frequency specified by a cron expression; and the Crowd Manager, that manages the interactions with a crowdsourcing platform.

During operations, Noah will generate pipeline queries for the responders engaged by the crowdsourcing platform. The responders will contribute to solve the system tasks needed to set up and maintain new pipelines: for example, tasks are needed to select initial extraction rules over every domain source, select and label the source attributes, finding the linkages between source objects to a common mediated domain object, and matching the source attributes across several sources to a mediated domain attribute.

System Tasks

The main system tasks that need to be tackled to set up a Noah pipeline are shown in Figure 4: Page Linkage, Data Extraction, Schema Matching, and Object Linkage.

Page Linkage aims at obtaining a first approximate top-𝑘 page linkages. Two pages have a linkage if they both publish data related to the same domain object.

Example 3.1 (Page Linkage). In Figure 2 we can see two possible page linkages at time 𝑡 𝑛 : {(𝑝 1 1 , 𝑝 2 1 ), (𝑝 1 𝑚 , 𝑝 2 4 )}. Their distances, i.e., 0.09 and 0.12, are shown at the top of Figure 5a.

Data Extraction aims at finding all the correct extraction rules. It generates all the possible extraction rules and discover the correct ones by exploiting the redundancy of published data across several independent sources [4] when available, while querying the responders [7] to confirm uncertain hypotheses.

Schema Matching aims at finding matches between extraction rules by exploiting an instance-based distance measure between source objects. The instance-based distance between two extraction rules assumes the availability of correct object linkages to align source objects related to the same domain object, as produced in output by the next system task: the distance is obtained Object Linkage aims at finding linkages between source objects by exploiting a pairwise attribute distance measure between source attributes. The pairwise attribute distance between two source objects assumes the availability of correct schema matches across the extraction rules to align source attributes related to the same domain attribute, as produced in output by the previous system task: the distance is obtained by averaging the distance between the two values over all matching attributes.

We name the linkage/matching loop of system tasks Linkage/Matching Duality; we further discuss it in Section 3.1.

Pipeline Queries

For every system task necessary to set up and maintain a pipeline, Noah tries to solve it by using a human-in-the-loop approach [9,26]: unsupervised algorithms will generate most-likely hypothesis based on the available redundancy. These hypothesis are later confuted or validated by means of queries posed to responders, initially only human responders, and later, also by using automatic responders based on ML models that have been trained with the data collected while operating the Noah pipeline (see Section 4).

An example of the queries posed to the responders for every system task is shown in Figure 4: Page Linkage, Data Extraction, Schema Matching and Object Linkage. Example 3.4 (Schema Matching Query). Figure 4 shows that schema matching tasks can be solved by means of queries confirming or refuting a single match: 'Do "108MP" and "20MP" refer to the same attribute of object "MI 10"?'. The template of the query to support a schema matching task has been filled up with values extracted from two pages of distinct sources, e.g., using extraction rules (𝑟 1 5 , 𝑟 2 4 ). These are two detail pages considered in a linkage, and "MI 10" is the name associated with the corresponding domain object.

Example 3.5 (Page Linkage Query). A query such as 'Do these two pages refer to the same object?' posed to human responders in Figure 4 can validate or refute a page linkage (𝑝 1 𝑚 , 𝑝 2 4 ). In order for the query to be as simple as possible [35], we can show the user a screenshot of the original pages.

Example 3.6 (Object Linkage Query). Unlike the case of page linkage tasks above, here the query is posed directly on source objects with extracted values. A query such as 'Do these 2 objects refer to the same?' posed to human responders in Figure 4 can validate or refute an object linkage (𝑝 1 1 , 𝑝 2 1 ). To make the query as simple as possible for an human responder, it is shown together with two records whose attributes have been already aligned by leveraging the results of a schema matching task.

The tremendous success of crowdsourcing [24] can be partially explained by saying that human supervision can represent the essential final ingredient to unmask those problems really hard to solve through automatic algorithms but that can be transformed into rather simple questions for human workers. However, it is well known that in practice, the availability and the accuracy of crowd workers, especially of unskilled ones, is strongly dependent on the way the questions are posed and rewarded [35]. One of the Noah goal is that of exploiting IKB, which is progressively built, also to make the crowdsourcing queries as simple as possible. For example, a query to check a record linkage exploits the schema matching already computed to make the two records easy to be visually compared.

Linkage / Matching Duality

Figure 4 shows that two important integration tasks operated by Noah pipelines, i.e., Schema Matching and Object Linkage, are part of a loop in which each one assumes the availability of the output of the other to solve its own task. Page Linkage is the system task outside the loop needed for its initial triggering.

We assume available two normalized distance functions providing a value between 0 and 1 when comparing two rules, and two source objects (records), respectively: the instance-based distance and the pairwise attribute distance. The former compares two rules over the values they extract from a set of detail pages which have been previously aligned, i.e., their linkages are fixed. ) and (𝑝 1 1 , 𝑝 2 1 ) be two given correct linkages for the detail pages associated with IPhone 12 and MI 10 source object from source 𝑆 1 and 𝑆 2 as shown in Figure 5d. The distance between the rules (𝑟 1 5 , 𝑟 2 4 ) can be computed as follows:

𝑑 (𝑟 1 5 , 𝑟 2 4 ) = 𝑑 (𝑟 1 5 (𝑝 1 1 ), 𝑟 2 4 (𝑝 2 1 )) + 𝑑 (𝑟 1 5 (𝑝 1 𝑚 ), 𝑟 2 4 (𝑝 2 4 )) = 𝑑 ('108MP', '108MP') + 𝑑 ('12MP', '14MP') = 2.9. The normalized distance in the range [0, 1] is 0.27.

Pairwise attribute distance: let (𝑟 2 2 , 𝑟 1 2 ) and (𝑟 1 1 , 𝑟 2 1 ) be two given correct matches for Brand and Model attributes (see Figure 5b). The distance between the two source objects about MI 10 PRO and MI 10T can be computed as follows:

𝑑 (𝑜 1 2 , 𝑜 2 2 ) = 𝑑 (𝑟 1 2 (𝑝 1 2 ), 𝑟 2 2 (𝑝 2 2 ))+ 𝑑 (𝑟 1 1 (𝑝 1 2 ), 𝑟 2 1 (𝑝2

2 )) = 𝑑 ('XIAOMI', 'XIAOMI') +𝑑 ('MI 10 PRO', 'MI 10T') = 3.2. The normalized distance in the range [0, 1] is 0.27.

We revisit and propose an extension of two domain properties, called Local Consistency and Separable Domain, underlying the formal approach presented in WEIR [4] for solving the extraction and matching problem when the page linkage is given as input.

Our ambition is twofold: on the one side, we aim to extend that approach to cover the whole trio of extraction, matching and linkage problems at the core of Noah pipelines; on the other hand, we want to relax the underlying assumptions by mean of the feedback provided by human crowd workers, so making the approach adaptable to domains with more disparate characteristics that those originally covered in the WEIR project. Here we briefly recall the two properties and sketch how we plan to extend them.

Local Consistency (LC) In a source there cannot be two distinct source attributes that refer to the same domain attribute. The dual property that we additionally assume is that two distinct detail pages from the same source cannot publish data about the same domain object. Separable Domain (SD) In a mapping composed of several extraction rules, each from a distinct source, and associated with the same domain attribute, the instance-based distances between the rules of the mapping are always smaller than distances with rules associated with a different domain attribute. For computing the instance-based distance, the object linkages are fixed and already known.

The dual property that we additionally assume is that in a linkage composed of several source objects from distinct sources and related to the same domain object, the pairwise attribute distances are always smaller than distances with source objects associated with a different domain object. For computing the pairwise attribute distance, the source attribute matches are fixed and already known.

For domains in which such properties hold, the WEIR system is able to match the extraction rules and build their mappings into cluster of source attributes related to the same domain attribute by comparing all the similarity distances, while at the same time, it can separate the correct extraction rules from noisy ones. The idea is pretty simple and depicted in Figure 5: DS suggests to sort the set of all possible matches (pair of extraction rules) by an instance-based distance leveraging the alignment of detail pages (see Figure 5c). Those pairs are then processed in order of increasing distances: every pair of rules are merged in the same mapping as long as the addition of the rules will not lead to a violation of the LC property, i.e., two rules (source attributes) from the same source would end up being present in the same output mapping (see Figure 5d). For certain domains, with sufficiently overlapping sources, WEIR can automatically find the correct extraction rules and their matching with rules over other sources provided that the correct linkages between detail pages are known.

The dual algorithm will solve the problem of finding correct object linkages provided that correct schema matches between source attributes are given as depicted in Figure 5: DS suggests to sort the set of all possible linkages (pair of source objects) by a pairwise attribute distance (see Figure 5a). Those pairs are then processed in order of increasing distances: every pair of source objects are merged in the same linkage as long as the addition of the objects processed into an existing linkage will not lead to a violation of the LC property, i.e., two source objects from the same source would end up being present in the same output linkage (see Figure 5b). This algorithm exploits the duality of the matching and linkage problems, in this setting, and it is at the core of integration engine for the Noah project. However, differently from WEIR, it does not halt the integration as soon as a LC violation is detected: rather, it generates pipeline queries to confirm the choice, and continue the processing of all pairs in increasing order of distances, until it is below a threshold over which no further matches/linkages are expected with meaningful distance functions.

Unfortunately, as also recognized in WEIR [4], some domains have sources and attributes with very similar but semantically different values (e.g., the resolution of the front/rear cameras in Figure 2). This situation easily lead to violation of the LC and SD assumptions, and finding the mappings is a challenging problem for many interesting domains.

Example 3.8 (Non-separable Domains for Schema Matching). In Figure 2, source S 1 and 𝑆 2 both have extraction rules ((𝑟 1 5 , 𝑟 1 6 ), and (𝑟 2 4 , 𝑟 2 5 ), respectively) with a low distance (Figure 5c) because camera resolutions (e.g., 1-front and 2-back) are typically within a small range of values expressed in megapixel (MP). In Figure 5d it is shown that the pair of rules (𝑟 1 5 , 𝑟 1 6 ) at distance 0.25 violates the LC and DS assumptions because their distance is smaller than the distance of (𝑟 1 5 , 𝑟 2 4 ) that is 0.27. Actually, it is well known that the Record Linkage dual problem, is even much more challenging than the Schema Matching itself: the attributes containing the correct signals for considering two objects equivalent can change from object to object even within the same source (think at smartphones of different brands with different policies for naming the models and differentiating the features of each model). Assuming that every object in the domain does not lead to a separability violation is quite unrealistic, beside toy cases.

Example 3.9 (Non-separable Domains for Object Linkage). In Figure 5b the linkage (𝑝 1 1 , 𝑝

2 ) violating the LC property has a pairwise attribute distance of only 0, 09 which is smaller than the distance of a the correct linkage (𝑝 1 1 , 𝑝 2 1 ), and therefore the domain is not separable.

We believe that the violations of LC and DS assumptions can be manually fixed and that they help to find the most informative pipeline queries that need to be posed to external responders, i.e., paid crowd workers, or suitably trained automatic responders.

By interleaving the dual linkage/matching algorithms in a loop in which external responders can contribute, as shown in Figure 4, each execution can contribute to improve the accuracy of the distance function used by the other task, either by improving the linkages used by the instance-based distance, or improving the matches used by the pairwise attribute distance.

Our vision is that with the precious help of crowdsourcing and a loop of interleaving linkage/matching operations, the desired target quality can be reached even in presence of non-separable domains: responders will be engaged to assess the quality of the output, and to repair the uncertain choices made by the integration algorithm. The linkages and matches confirmed by human feedback can be frozen and exploited in the following iterations, somehow progressively solving and hence removing from the domain the linkages or matches that made the domain inseparable.

RESEARCH DIRECTIONS

In the early stages of its life, the IKB K of a new Noah pipeline might be scarcely populated. As redundancy builds up over time with the addition of new sources to feed up the IKB, the accuracy of the extraction and integration process increases.

The absence of overlapping between objects and attributes published by a rather limited set of sources could limit the amount of available redundancy. In this situation, for operating the pipeline, Noah would end up generating a lot of queries supporting the system tasks. As an alternative solution, Noah supports the incremental addition of a source into an existing pipeline. A new source might contribute to lower the overall costs if it significantly overlaps with the sources already available for the domain [14]. On the contrary, to integrate new sources publishing new objects or new attributes, additional costs might be incurred to support the integration with existing IKB.

We are interested to study ML techniques that could decrease crowdsourcing costs even in absence of redundancy. The main research area is that of synthesizing automatic responders capable of answering the same type of pipeline queries that are normally posed to human responders for solving Noah tasks, with the goal of progressively replacing human responders [7] and scaling the approach up to many thousands of sources.

Unfortunately, state-of-the-art ML unsupervised techniques [40,42] can be adapted to provide accurate and reliable answers to those queries only if enough training data have been collected. Indeed, fairness and bias, or simply misuse of machine learning algorithms, is a well-known problem in literature [12,32] that affects many development projects, especially in the scenarios which are most commonly found in practice [38]: pre-trained ML models and/or enough training data are not available up-front, so that the ML models cannot be properly tuned and exhibit erratic and unpredictable performance [41].

Snorkel [36] is another project exploiting the idea of leveraging human work to train ML algorithms. However, it is based on the idea of engaging skilled workers in every step of the processing pipeline, while Noah aims at engaging non skilled workers to whom can be interchangeably posed queries in the same form as those posed to automatic responders. Several other projects such as qodco [2] and SEER [22] have made use of crowdsourcing by mainly focusing on the problem of selecting the correct extraction rules, while Noah applies the same query control methodology for all the tasks in the considered pipelines.

It is also well known that by using automatic responders not accurate enough, it might turn out to be more expensive engaging them than not using them at all, as additional human workers should be engaged only to offset their wrong answers [7]. We envision a system in which crowd workers are used for indirectly controlling the deployment of automatic responders, and the two types of responders are interchangeably engaged. Crowdsourcing workers contribute to collect domain data that are then used to train and evaluate automatic responders, before fully deploying them. Automatic responders will progressively replace crowd workers to scale the approach and to lower the operating costs, but only after enough evidence that their accuracy does not compromise the overall guaranteed output quality data. At regime, crowd workers will be minimally used only to keep monitoring the performance of automatic responders.

We have identified several novel research challenges:

• formalizing and proving the correctness of an algorithm that solves the full trio of extraction, matching and linkage tasks; • creating and maintaining over time the continuous Web data processing pipelines at low costs, with guaranteed output quality; • designing several independent automatic responders based on ML models that are capable of answering queries normally posed to crowd workers; • effectively measuring the available redundancy in a domain; • estimating from the characteristics of a domain the crowdsourcing costs necessary to obtain and maintain the desired output quality.

Figure 1 :1Figure 1: Web detail pages in the Smartphone domain.

Figure 2 :2Figure 2: Running Example -The Smartphones domain includes 2 sources crawled at 𝑛 instants. Over each source 6 correct extraction rules working on several detail pages are given: 𝑟 𝑖 𝑗 (𝑗 = 1, . . . , 6) denotes the 𝑗-th rule working on source 𝑆 𝑖 , each extracting the value of a source attribute from a detail page associated with a source object. For example, 𝑝 1 3 indicates the page about IPhone 11 from source 𝑆 1 and rule 𝑟 1 2 extracts the Model from every page of the same source. At every time 𝑡, the values extracted from the two sources are conveniently depicted as organized in tables: each row of the table is associated with a detail page of the source, and each column is associated with an extraction rule around the same source. The set of domain attributes includes: Model, Brand, Price, Memory, Camera 1, Camera 2. Correct linkages can be represented as pairs of pages about the same domain objects: {(𝑝 1 1 , 𝑝2 1 ), (𝑝 1 𝑚 , 𝑝 2 4 )}. Correct source attribute matches can be represented as pair of correct extraction rules: {(𝑟 1 1 , 𝑟 2 1 ), (𝑟 1 2 , 𝑟 2 2 ), (𝑟 1 4 , 𝑟 2 3 ), (𝑟 1 5 , 𝑟 2 4 ), (𝑟 1 6 , 𝑟 2 5 )}.

Figure 3 :3Figure 3: Overview of Noah System & Pipelines created

Figure 4 :4Figure 4: Running example (Pipeline example with queries): tasks provided by system and query generated for hybrid human-machine responders.

Example 3 . 3 (33Data Extraction Query).

Figure 44shows an example of query for Data Extraction tasks. The uncertainty of an extraction rule generated by wrapper inference can be validated by checking the extracted value on a detail page by means of a query such as: "Is '1050$' a Price?", where Price is a candidate label for the extraction rule and '1050 $' is the extracted value.

Figure 5 :5Figure 5: Running example (Distance Similarity): 5a and 5c show distances in Pyramids; 5b and 5d expose relations in Cartesian Plane where 'Uncertainties' are due to the breaking of LC with Non-separable Domain

The viability of crowdsourcing for survey research TaraSBehrend DavidJSharek AdamWMeade EricNWiebe Behavior research methods 43 3 800 2011. 2011 Query-oriented data cleaning with oracles MoriaBergman TovaMilo SlavaNovgorodov Wang-ChiewTan Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data the 2015 ACM SIGMOD International Conference on Management of Data 2015 Open information extraction from question-answer pairs NikitaBhutani YoshihikoSuhara Wang-ChiewTan AlonHalevy HosagraharVisvesvaraya Jagadish arXiv:1903.00172 2019. 2019 arXiv preprint Extraction and integration of partially overlapping web sources MirkoBronzi ValterCrescenzi PaoloMerialdo PaoloPapotti Proceedings of the VLDB Endowment 6 10 2013. 2013 Crowdsourcing for data management ValterCrescenzi AlvaroAaFernandes PaoloMerialdo NormanWPaton Knowledge and Information Systems 53 1 2017. 2017 RoadRunner: automatic data extraction from data-intensive web sites GiansalvatoreValter Crescenzi PaoloMecca Merialdo Proceedings of the 2002 ACM SIGMOD international conference on Management of data the 2002 ACM SIGMOD international conference on Management of data 2002 Hybrid Crowd-Machine Wrapper Inference PaoloValter Crescenzi DishengMerialdo Qiu ACM Transactions on Knowledge Discovery from Data (TKDD) 13 5 2019. 2019 An analysis of structured data on the web NileshDalvi AshwinMachanavajjhala BoPang arXiv:1203.6406 2012. 2012 arXiv preprint Human-in-the-Loop Data Analysis: A Personal Perspective AnhaiDoan 10.1145/3209900.3209913 Proceedings of the Workshop on Human-In-the-Loop Data Analytics (HILDA'18) the Workshop on Human-In-the-Loop Data Analytics (HILDA'18)

New York, NY, USA

Association for Computing Machinery 2018 1 Toward a System Building Agenda for Data Integration AnhaiDoan AdelArdalan JeffreyRBallard SanjibDas YashGovind PradapKonda HanLi ErikPaulson PaulSuganthan GC HaojunZhang arXiv:cs.DB/1710.00027 2017 Principles of data integration AnhaiDoan AlonHalevy ZacharyIves 2012 Elsevier A Few Useful Things to Know about Machine Learning PedroDomingos 10.1145/2347736.2347755 Commun. ACM 55 2012. Oct. 2012 Knowledge vault: A web-scale approach to probabilistic knowledge fusion XinDong EvgeniyGabrilovich GeremyHeitz WilkoHorn NiLao KevinMurphy ThomasStrohmann ShaohuaSun WeiZhang Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining the 20th ACM SIGKDD international conference on Knowledge discovery and data mining 2014 Less is more: Selecting sources wisely for integration XinDong BarnaSaha DiveshSrivastava Proceedings of the VLDB Endowment the VLDB Endowment 2012 6 Big data integration XinLuna Dong DiveshSrivastava Synthesis Lectures on Data Management 7 1 2015. 2015 Identifying relations for open information extraction AnthonyFader StephenSoderland OrenEtzioni Proceedings of the 2011 conference on empirical methods in natural language processing the 2011 conference on empirical methods in natural language processing 2011 Foundations of data quality management WenfeiFan FlorisGeerts Synthesis Lectures on Data Management 4 5 2012. 2012 DIADEM: thousands of websites to a single database TimFurche GeorgGottlob GiovanniGrasso XiaonanGuo GiorgioOrsi ChristianSchallhart ChengWang Proceedings of the VLDB Endowment 7 14 2014. 2014 Corleone: hands-off crowdsourcing for entity matching ChaitanyaGokhale SanjibDas AnhaiDoan NarasimhanJeffrey F Naughton JudeRampalli XiaojinShavlik Zhu Proceedings of the 2014 ACM SIGMOD international conference on Management of data the 2014 ACM SIGMOD international conference on Management of data 2014 Web-scale information extraction with vertex PankajGulhane AmitMadaan RupeshMehta JeyashankherRamamirtham RajeevRastogi SandeepSatpal HSrinivasan AshwinSengamedu CharuTengli Tiwari IEEE 27th International Conference on Data Engineering. IEEE 2011. 2011 RED: Redundancy-Driven Data Extraction from Result Pages? JinsongGuo ValterCrescenzi TimFurche GiovanniGrasso GeorgGottlob 10.1145/3308558.3313529 The World Wide Web Conference, WWW 2019 LingLiu RyenWWhite AminMantrach FabrizioSilvestri JulianJMcauley RicardoBaeza-Yates LeilaZia

San Francisco, CA, USA

ACM 2019. May 13-17, 2019 Synthesizing extraction rules from user examples with seer AzzaMaeda F Hanafi LauraAbouzied YunyaoChiticariu Li Proceedings of the 2017 ACM International Conference on Management of Data the 2017 ACM International Conference on Management of Data 2017 How to measure data quality? A metric-based approach BerndHeinrich MarcusKaiser MathiasKlier 2007. 2007 The rise of crowdsourcing JeffHowe Wired magazine 14 6 2006. 2006 Wrapper induction for information extraction NicholasKushmerick RobertDaniel S Weld Doorenbos 1997 University of Washington Washington Human-in-the-Loop Data Integration GuoliangLi 10.14778/3137765.3137833 Proc. VLDB Endow 10 12 2017. Aug. 2017 Crowdsourced Data Management: Overview and Challenges GuoliangLi YudianZheng JuFan JiannanWang ReynoldCheng 10.1145/3035918.3054776 Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD '17) the 2017 ACM International Conference on Management of Data (SIGMOD '17)

New York, NY, USA

Association for Computing Machinery 2017 ColinLockard XinLuna Dong ArashEinolghozati PrashantShiralkar arXiv:1804.04635 Ceres: Distantly supervised relation extraction from the semi-structured web 2018. 2018 arXiv preprint Openceres: When open information extraction meets the semi-structured web ColinLockard PrashantShiralkar XinLuna Dong Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2019 1 Crowdsourced data management: Industry and academic perspectives AdamMarcus AdityaParameswaran Foundations and Trends in Databases 6 2015. 2015 Open information extraction systems and downstream applications MausamMausam Proceedings of the twenty-fifth international joint conference on artificial intelligence the twenty-fifth international joint conference on artificial intelligence 2016 A survey on bias and fairness in machine learning NinarehMehrabi FredMorstatter NripsutaSaxena KristinaLerman AramGalstyan arXiv:1908.09635 2019. 2019 arXiv preprint ChristinaNiklaus MatthiasCetto AndréFreitas SiegfriedHandschuh arXiv:1806.05599 A survey on open information extraction 2018. 2018 arXiv preprint A Survey of Approaches to Automatic Schema Matching ErhardRahm PhilipBernstein 10.1007/s007780100057 VLDB J 10 12 2001. 2001 User Interface Design for Crowdsourcing Systems BaharehRahmanian JosephGDavis 10.1145/2598153.2602248 Proceedings of the 2014 International Working Conference on Advanced Visual Interfaces (AVI '14 the 2014 International Working Conference on Advanced Visual Interfaces (AVI '14

New York, NY, USA

Association for Computing Machinery 2014 Snorkel: Fast training set generation for information extraction StephenHAlexander J Ratner HenryRBach ChrisEhrenberg Ré Proceedings of the 2017 ACM international conference on management of data the 2017 ACM international conference on management of data 2017 Open language learning for information extraction MichaelSchmitz StephenSoderland RobertBart OrenEtzioni Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning 2012 Democratizing Data Science through Interactive Curation of ML Pipelines ZeyuanShang EmanuelZgraggen BenedettoBuratti FerdinandKossmann PhilippEichmann YeounohChung CarstenBinnig EliUpfal TimKraska 10.1145/3299869.3319863 Proceedings of the 2019 International Conference on Management of Data (SIGMOD '19) the 2019 International Conference on Management of Data (SIGMOD '19)

New York, NY, USA

Association for Computing Machinery 2019 Schema-Agnostic Entity Matching using Pre-trained Language Models Kai-ShengTeong Lay-KiSoon Tin TinSu Proceedings of the 29th ACM International Conference on Information & Knowledge Management the 29th ACM International Conference on Information & Knowledge Management 2020 Learning to learn SebastianThrun LorienPratt 2012 Springer Science & Business Media Accelerating Human-in-the-Loop Machine Learning: Challenges and Opportunities DorisXin LitianMa JialinLiu StephenMacke ShuchenSong AdityaParameswaran 10.1145/3209889.3209897 Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning (DEEM'18) the Second Workshop on Data Management for End-To-End Machine Learning (DEEM'18)

New York, NY, USA

Association for Computing Machinery 2018 Federated machine learning: Concept and applications QiangYang YangLiu TianjianChen YongxinTong ACM Transactions on Intelligent Systems and Technology (TIST) 10 2 2019. 2019