Noah: Creating Data Integration Pipelines over Continuously
                    Extracted Web Data
                              Valerio Cetorelli                                                            Valter Crescenzi
                           Università Roma Tre                                                           Università Roma Tre
                       valerio.cetorelli@uniroma3.it                                                 valter.crescenzi@uniroma3.it

                               Paolo Merialdo                                                                 Roger Voyat
                            Università Roma Tre                                                          Università Roma Tre
                        paolo.merialdo@uniroma3.it                                                     roger.voyat@uniroma3.it

ABSTRACT                                                                                  In order to contain the crowdsourcing costs, the proposed
We present Noah, an ongoing research project aiming at devel-                          approach leverages two techniques. First, it exploits the inherent
oping a system for semi-automatically creating end-to-end Web                          redundancy of Web sources to automatically find correct domain
data processing pipelines. The pipelines continuously extract and                      information: data published by several independent sources are
integrate information from multiple sites by leveraging the re-                        more likely to be correct and can be easily discerned by noisy
dundancy of the data published on the Web. The system is based                         or non-relevant data [8, 15]. Secondly, it exploits the collected
on a novel hybrid human-machine learning approach in which                             data to continuously train ML models. Those ML models are pro-
the same type of questions can be interchangeably posed both                           gressively introduced in the form of automatic responders that
to human crowd workers and to automatic responders based on                            replace crowd workers [1, 30], and are continuously evaluated
machine learning (ML) models. Since the early stages of pipelines,                     during each step of the data processing pipelines: only respon-
crowd workers are engaged to guarantee the output data quality,                        ders that become sufficiently reliable are fully deployed in the
and to collect training data, that are then used to progressively                      operations of the created pipelines.
train and evaluate automatic responders. The latter are fully de-
ployed into the data processing pipelines to scale the approach
and to contain the crowdsourcing costs later. The combination
of guaranteed quality and progressive reductions of costs of the
pipelines generated by our system can improve the investments
and development processes of many applications that build on
the availability of such data processing pipelines.

                                                                                          Figure 1: Web detail pages in the Smartphone domain.
1    INTRODUCTION AND MOTIVATION
The Web is the largest knowledge base ever built by humans.                                Problem Description. Given a set of sources S = {𝑆 1, 𝑆 2 . . .}
However, most of the data on the Web are not directly available                        from the same domain (e.g., Smartphones); each source 𝑆𝑖 is spec-
to applications, unless complex data extraction and integration                        ified by means of 𝑛𝑖 URLs of detail pages about domain objects
pipelines are set-up. Creating these pipelines to build structured                     (e.g., IPhone 12, Mi 10T). By detail page we mean a page report-
knowledge bases and continuously maintain them in a cost effec-                        ing information about a particular object, the topic entity [29]
tive way is still a challenging problem. Currently, most projects                      of the page, on which it publishes values of several attributes.
fulfill their data processing needs by means of case-by-case solu-                     An example of detail pages from two sources, about the same
tions that cannot be reused across projects.                                           IPhone 12 domain object is shown in Figure 1 where the values of
   This paper presents Noah, a research project that aims at devel-                    several attributes of interest such as Model, Memory, Price are
oping a system for creating and maintaining over time end-to-end                       highlighted.
data processing pipelines for continuously extracting and inte-                            A domain includes a set of objects O = {𝑜 1, 𝑜 2, . . .} and a set
grating Web data. Noah is based on an hybrid human-machine                             of attributes A = {𝐴1, 𝐴2, . . .} which will be populated with data
learning approach, whose goal is to guarantee the quality of pro-                      extracted from the pages of the sources belonging to that domain.
cessed data by leveraging feedbacks provided by human crowd                            New attributes and new objects of a domain can be discovered
workers. Our approach can be classified in the realm of Open                           as new sources are considered part of the domain.
Information Extraction [31], because it aims at extracting and                             Each source publishes detail pages reporting the values of a
integrating information both at the instance (objects) and at                          subset of domain attributes, for a subset of domain objects. We
the schema (attributes) levels into an internal knowledge base                         use the terms source attributes or source objects when we want to
(IKB) that is created, populated and maintained for every domain.                      denote the version of a domain attribute or object as published
Indeed, if new sources are incrementally added to an already                           by a source, i.e., we are referring to the occurrences of attribute
generated pipeline, the system is able to discover new entities                        values about an object as published by a source. It is worth notic-
and new attributes from the aforementioned sources.                                    ing that some domain attribute can be published, possibly with
                                                                                       inconsistencies amongst the provided values, by several sources,
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed-   e.g., Model, while other attributes, e.g., ReviewScore or Price,
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus)       have values which are inherently source-specific.
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0)                                                                  In the following, we identify source objects by means of the
                                                                                       URL of the detail page hosting its data, and we identify source
Figure 2: Running Example — The Smartphones domain includes 2 sources crawled at 𝑛 instants. Over each source 6 correct
extraction rules working on several detail pages are given: 𝑟 𝑖𝑗 (𝑗 = 1, . . . , 6) denotes the 𝑗-th rule working on source 𝑆𝑖 , each
extracting the value of a source attribute from a detail page associated with a source object. For example, 𝑝 31 indicates the
page about IPhone 11 from source 𝑆 1 and rule 𝑟 21 extracts the Model from every page of the same source. At every time 𝑡, the
values extracted from the two sources are conveniently depicted as organized in tables: each row of the table is associated
with a detail page of the source, and each column is associated with an extraction rule around the same source. The set
of domain attributes includes: Model, Brand, Price, Memory, Camera 1, Camera 2. Correct linkages can be represented as pairs
of pages about the same domain objects: {(𝑝 11, 𝑝 12 ), (𝑝𝑚                  1 , 𝑝 2 )}. Correct source attribute matches can be represented as pair of
                                                                                     4
correct extraction rules: {(𝑟 1 , 𝑟 1 ), (𝑟 2 , 𝑟 2 ), (𝑟 4 , 𝑟 3 ), (𝑟 5 , 𝑟 42 ), (𝑟 61, 𝑟 52 )}.
                              1     2       1     2       1     2       1


attributes by means of a unique, within the domain, identifier                    The focus of our research project covers the three problems
of the extraction rule that is capable of locating its value from              that we believe are at the core of any Web data integration
the detail page. By extraction rule we mean a function extracting              pipeline: extraction, matching, and linkage. It does not include,
at most one value from a detail page. It does not matter the                   on one hand, the sources discovery problem, and the automatic
formalism, e.g., XPath expressions, in which it will be specified.             synthesis of crawling programs; on the other hand, it does not
    Our goal is that of continuously extracting data of guaranteed             include the data fusion problem.
level of quality from the detail pages composing to sources. The                  Our solution can help several projects that need to set up and
data are reorganized into an Integrated Knowledge Graph (IKB)                  maintain over time Web data processing pipelines, but require a
while minimizing the overall costs. As a measure of data quality,              guaranteed quality of the pipelines’ output data to be business
we will use standard measures such as precision, recall, and 𝐹 -               meaningful.
measure over integrated data [23]. As a measure of the cost, the                  Clearly, the amount of work outsourced to crowd workers
goal is that of minimizing the crowdsourcing costs [5, 27].                    to guarantee the quality level largely depends on the inherent
    In IKB the following information will be available: (linkages              characteristic of the domain: those containing static attributes
and matches) how the source attributes and objects are respec-                 that are largely redundant from source to source can dramatically
tively mapped to the domain attributes and objects; (values prove-             simplify domain data detection, extraction and schema matching;
nance) the source attribute values for every object in the domain.             an attribute working as a soft identifier across several sources can
    The problem we want to solve is that of continuously creating              contribute significantly to reduce the cost of the record linkage
K 𝑡 , that is an IKB at every time 𝑡 in which the snapshots of the             task for a domain (i.e, books’ ISBN).
detail pages from every source in a domain D are gathered. We                     Unfortunately, it turns out that many interesting domains (e.g.,
illustrate the problem definition by means of a running example                job postings, real estates, . . . ) do not exhibit such redundancy
shown in Figure 2.                                                             and the type of redundancy that the system has to exploit is
                                                                               at an intensional level, i.e., type and format of values, range of
                                                                               values, labels of extracted data. Generally speaking, separating
2    SCOPE, OPPORTUNITIES, CHALLENGES                                          domain data from other information become largely dependent
                                                                               on the context in which the attributes are proposed, and on
Building and maintaining effective data processing pipelines over
                                                                               the availability of human feedback to check the correctness of
Web data is a challenging problem for several reasons. First, Web
                                                                               proposed hypotheses.
sources are autonomous and remote: they can unpredictably
change and therefore break all the extraction rules created on
previous versions of the same source to extract data. Second, the              Redundancy as OpenIE Enabler
set up of an integration pipeline requires to solve many inter-                The redundancy plays a fundamental role in our system to keep
related tasks, each of which has motivated flurry of research                  the crowdsourcing costs at reasonable levels. Whenever redun-
works, including: sources discovery, data extraction, schema match-            dancy of data across sources is properly detected and exploited,
ing, record linkage, data fusion, data labeling, and data cleaning.            domain data can be discerned by other noisy or out-of-domain in-
Each of these problems has been extensively studied over the                   formation. For example, WEIR [4] assumes that linkages between
last decades, with tens, if not hundreds in some cases, of well-               collection of pages from two sources are already known as part
recognized research works [6, 13, 19, 34, 39].                                 of the input, and then it exploits the redundancy of distinct and
independent sources that publish information about the same              adopted ML algorithms but lack the amount and quality of train-
objects and attributes to automatically find correct extraction          ing data, and the validation, needed to guarantee the desired
rules and schema matches.                                                output quality.
   Noah aims at escalating to the largest possible extent the
use of redundancy for extracting and integrating Web data as
pioneered by WEIR. It will exploit at least the following forms
for redundancy:
    Intensional several sources publish the same domain at-
       tributes
    Extensional several sources publish information about the
       same domain objects
    Temporal a source publish data about the same domain
       objects and attributes over time
    Intra-source a source can publish data about the same ob-
       jects in pages of distinct type, e.g., a result page containing    Figure 3: Overview of Noah System & Pipelines created
       snippet of records with most relevant attributes plus link
       to detail pages containing all attributes [21]
At the same time, and with the help of human feedback, Noah              3 NOAH SYSTEM AND PIPELINES
aims at overcoming WEIR’s limitations by relaxing its rather             The Noah system supports the semi-automatic generation of
strict underlying assumptions on the input domain: WEIR re-              end-to-end Web data processing pipelines over several domains.
quires that enough intensional and extensional redundancy is             Figure 3 shows how the system can generate and operate many
available to discern all domain data from all other information.         pipelines at the same time, each having an IKB that is progres-
   WEIR and Noah falls in the realm of the OpenIE approaches [3,         sively and continuously populated with data coming from the
4, 16, 29, 33, 37]: unlike the ClosedIE approaches [18, 20, 25, 28]      sources of the domain on which it operates. Our system will in-
where the managed knowledge base does not grow in terms of               teract with external systems by means of two major components:
subjects and predicates but only in terms of values, new schema          the Crawler, that continuously downloads snapshot of pages from
information, e.g., new domain attributes, can be progressively           every source with a frequency specified by a cron expression;
discovered while populating the knowledge base with entities             and the Crowd Manager, that manages the interactions with a
and values of schema already known.                                      crowdsourcing platform.
   There are two main differences between Noah and other Ope-               During operations, Noah will generate pipeline queries for
nIE [29] systems: first, we do not require a pre-populated Knowl-        the responders engaged by the crowdsourcing platform. The
edge Base, as we start from an empty IKB and we populate it as           responders will contribute to solve the system tasks needed to set
new sources over the domain; second, we aim at continuously              up and maintain new pipelines: for example, tasks are needed to
extracting and integrating data [11], as we believe that the tempo-      select initial extraction rules over every domain source, select and
ral setting is important both for business reasons (many projects        label the source attributes, finding the linkages between source
need continuous stream of data rather than snapshots), and for           objects to a common mediated domain object, and matching the
taking into the main problem definition the maintenance costs of         source attributes across several sources to a mediated domain
the generated pipelines over time, costs that are largely neglected      attribute.
in many research proposals [29].
   Despite many of the problems that need to be tackled to cre-          System Tasks
ate our pipelines have already been extensively covered in the           The main system tasks that need to be tackled to set up a Noah
research literature, we believe that semi-automatizing the cre-          pipeline are shown in Figure 4: Page Linkage, Data Extraction,
ation of Web data processing pipelines can be still considered a         Schema Matching, and Object Linkage.
relevant problem [10].                                                      Page Linkage aims at obtaining a first approximate top-𝑘 page
   We argue that if the costs and the guaranteed level of qual-          linkages. Two pages have a linkage if they both publish data
ity [17] are explicitly considered, many projects relying on data        related to the same domain object.
processing pipelines can be re-conducted into a much more con-
trollable investment and validation process, and their overall               Example 3.1 (Page Linkage). In Figure 2 we can see two possi-
feasibility can be significantly improved because many business          ble page linkages at time 𝑡𝑛 : {(𝑝 11, 𝑝 12 ), (𝑝𝑚
                                                                                                                          1 , 𝑝 2 )}. Their distances,
                                                                                                                                4
projects are strongly and directly affected by the cost of creating      i.e., 0.09 and 0.12, are shown at the top of Figure 5a.
and maintaining the underlying Web data processing pipelines.               Data Extraction aims at finding all the correct extraction rules.
   Moreover, we believe that by posing to human and automatic            It generates all the possible extraction rules and discover the
responders the same type of queries, they become interchange-            correct ones by exploiting the redundancy of published data
able enough to motivate the study of new deployment methodolo-           across several independent sources [4] when available, while
gies for Web data processing pipelines. The goal of such method-         querying the responders [7] to confirm uncertain hypotheses.
ologies is to progressively lowering the crowdsourcing costs by             Schema Matching aims at finding matches between extraction
means of machine-learning techniques while keeping under con-            rules by exploiting an instance-based distance measure between
trol the output quality level since the early stages of the deployed     source objects. The instance-based distance between two extrac-
pipelines. Indeed, many development projects often experience            tion rules assumes the availability of correct object linkages to
unpredictable and erratic time-to-market (TTM) and return-on-            align source objects related to the same domain object, as pro-
investment (ROI) because, especially in the early stages, they           duced in output by the next system task: the distance is obtained
Figure 4: Running example (Pipeline example with queries): tasks provided by system and query generated for hybrid
human-machine responders.


by averaging the distance between extracted values over all the                   with values extracted from two pages of distinct sources, e.g.,
aligned detail pages.                                                             using extraction rules (𝑟 51, 𝑟 42 ). These are two detail pages con-
                                                                                  sidered in a linkage, and "MI 10" is the name associated with the
   Example 3.2 (Schema Matching). Consider source 𝑆 1 and 𝑆 2
                                                                                  corresponding domain object.
at time 𝑡𝑛 and the set of page linkages {(𝑝 11, 𝑝 12 ), (𝑝𝑚       1 , 𝑝 2 )} in
                                                                         4
Figure 2: possible matches are {(𝑟 1 , 𝑟 1 ), (𝑟 4 , 𝑟 3 ), (𝑟 4 , 𝑟 32 )}. The
                                     1   2       1     2       1                     Example 3.5 (Page Linkage Query). A query such as ’Do these
pairwise attribute distances, i.e., 0.19 and 0.22, are shown at                   two pages refer to the same object?’ posed to human responders in
                                                                                  Figure 4 can validate or refute a page linkage (𝑝𝑚1 , 𝑝 2 ). In order
the top of Figure 5c.                                                                                                                     4
                                                                                  for the query to be as simple as possible [35], we can show the
   Object Linkage aims at finding linkages between source objects                 user a screenshot of the original pages.
by exploiting a pairwise attribute distance measure between
source attributes. The pairwise attribute distance between two                       Example 3.6 (Object Linkage Query). Unlike the case of page
source objects assumes the availability of correct schema matches                 linkage tasks above, here the query is posed directly on source
across the extraction rules to align source attributes related to                 objects with extracted values. A query such as ’Do these 2 objects
the same domain attribute, as produced in output by the previous                  refer to the same?’ posed to human responders in Figure 4 can
system task: the distance is obtained by averaging the distance                   validate or refute an object linkage (𝑝 11, 𝑝 12 ). To make the query as
between the two values over all matching attributes.                              simple as possible for an human responder, it is shown together
   We name the linkage/matching loop of system tasks Link-                        with two records whose attributes have been already aligned by
age/Matching Duality; we further discuss it in Section 3.1.                       leveraging the results of a schema matching task.
                                                                                     The tremendous success of crowdsourcing [24] can be partially
Pipeline Queries                                                                  explained by saying that human supervision can represent the
For every system task necessary to set up and maintain a pipeline,                essential final ingredient to unmask those problems really hard to
Noah tries to solve it by using a human-in-the-loop approach [9,                  solve through automatic algorithms but that can be transformed
26]: unsupervised algorithms will generate most-likely hypothe-                   into rather simple questions for human workers. However, it is
sis based on the available redundancy. These hypothesis are later                 well known that in practice, the availability and the accuracy of
confuted or validated by means of queries posed to responders,                    crowd workers, especially of unskilled ones, is strongly depen-
initially only human responders, and later, also by using auto-                   dent on the way the questions are posed and rewarded [35]. One
matic responders based on ML models that have been trained                        of the Noah goal is that of exploiting IKB, which is progressively
with the data collected while operating the Noah pipeline (see                    built, also to make the crowdsourcing queries as simple as pos-
Section 4).                                                                       sible. For example, a query to check a record linkage exploits
   An example of the queries posed to the responders for every                    the schema matching already computed to make the two records
system task is shown in Figure 4: Page Linkage, Data Extraction,                  easy to be visually compared.
Schema Matching and Object Linkage.
                                                                                  3.1    Linkage / Matching Duality
   Example 3.3 (Data Extraction Query). Figure 4 shows an ex-
                                                                                  Figure 4 shows that two important integration tasks operated by
ample of query for Data Extraction tasks. The uncertainty of an
                                                                                  Noah pipelines, i.e., Schema Matching and Object Linkage, are
extraction rule generated by wrapper inference can be validated
                                                                                  part of a loop in which each one assumes the availability of the
by checking the extracted value on a detail page by means of a
                                                                                  output of the other to solve its own task. Page Linkage is the
query such as: "Is ’1050$’ a Price?", where Price is a candidate
                                                                                  system task outside the loop needed for its initial triggering.
label for the extraction rule and ’1050 $’ is the extracted value.
                                                                                     We assume available two normalized distance functions pro-
   Example 3.4 (Schema Matching Query). Figure 4 shows that                       viding a value between 0 and 1 when comparing two rules, and
schema matching tasks can be solved by means of queries con-                      two source objects (records), respectively: the instance-based dis-
firming or refuting a single match: ’Do "108MP" and "20MP"                        tance and the pairwise attribute distance. The former compares
refer to the same attribute of object "MI 10"?’. The template of                  two rules over the values they extract from a set of detail pages
the query to support a schema matching task has been filled up                    which have been previously aligned, i.e., their linkages are fixed.
     (a) Linkages Distances                 (b) Page Linkages over 2 Attributes                (c) Matches Distances           (d) Matches over 2 Linkages

Figure 5: Running example (Distance Similarity): 5a and 5c show distances in Pyramids; 5b and 5d expose relations in
Cartesian Plane where ’Uncertainties’ are due to the breaking of LC with Non-separable Domain


The latter compares two source objects over the values of some                                 with source objects associated with a different domain
of their attributes which have been previously aligned, i.e., their                            object. For computing the pairwise attribute distance, the
matches are fixed.                                                                             source attribute matches are fixed and already known.
    Example 3.7 (Normalized Distance Functions). Instance-based                             For domains in which such properties hold, the WEIR system
distance: let (𝑝𝑚          1 , 𝑝 2 ) and (𝑝 1 , 𝑝 2 ) be two given correct linkages
                                 4           1 1                                        is able to match the extraction rules and build their mappings into
for the detail pages associated with IPhone 12 and MI 10 source                         cluster of source attributes related to the same domain attribute
object from source 𝑆 1 and 𝑆 2 as shown in Figure 5d. The distance                      by comparing all the similarity distances, while at the same time,
between the rules (𝑟 51, 𝑟 42 ) can be computed as follows: 𝑑 (𝑟 51, 𝑟 42 ) =           it can separate the correct extraction rules from noisy ones. The
𝑑 (𝑟 51 (𝑝 11 ), 𝑟 42 (𝑝 12 )) + 𝑑 (𝑟 51 (𝑝𝑚
                                           1 ), 𝑟 2 (𝑝 2 )) = 𝑑 ( ‘108MP’, ‘108MP’) +
                                                  4 4                                   idea is pretty simple and depicted in Figure 5: DS suggests to
𝑑 ( ‘12MP’, ‘14MP’) = 2.9. The normalized distance in the range [0, 1]                  sort the set of all possible matches (pair of extraction rules) by
is 0.27.                                                                                an instance-based distance leveraging the alignment of detail
    Pairwise attribute distance: let (𝑟 22, 𝑟 21 ) and (𝑟 11, 𝑟 12 ) be two given       pages (see Figure 5c). Those pairs are then processed in order of
correct matches for Brand and Model attributes (see Figure 5b).                         increasing distances: every pair of rules are merged in the same
The distance between the two source objects about MI 10 PRO and                         mapping as long as the addition of the rules will not lead to a
MI 10T can be computed as follows: 𝑑 (𝑜 21, 𝑜 22 ) = 𝑑 (𝑟 21 (𝑝 21 ), 𝑟 22 (𝑝 22 ))+    violation of the LC property, i.e., two rules (source attributes)
𝑑 (𝑟 11 (𝑝 21 ), 𝑟 12 (𝑝 22 )) = 𝑑 ( ‘XIAOMI’, ‘XIAOMI’) +𝑑 ( ‘MI 10 PRO’, ‘MI 10T’)    from the same source would end up being present in the same
= 3.2. The normalized distance in the range [0, 1] is 0.27.                             output mapping (see Figure 5d). For certain domains, with suf-
   We revisit and propose an extension of two domain properties,                        ficiently overlapping sources, WEIR can automatically find the
called Local Consistency and Separable Domain, underlying the                           correct extraction rules and their matching with rules over other
formal approach presented in WEIR [4] for solving the extraction                        sources provided that the correct linkages between detail pages
and matching problem when the page linkage is given as input.                           are known.
   Our ambition is twofold: on the one side, we aim to extend                               The dual algorithm will solve the problem of finding correct
that approach to cover the whole trio of extraction, matching                           object linkages provided that correct schema matches between
and linkage problems at the core of Noah pipelines; on the other                        source attributes are given as depicted in Figure 5: DS suggests
hand, we want to relax the underlying assumptions by mean of                            to sort the set of all possible linkages (pair of source objects) by a
the feedback provided by human crowd workers, so making the                             pairwise attribute distance (see Figure 5a). Those pairs are then
approach adaptable to domains with more disparate character-                            processed in order of increasing distances: every pair of source
istics that those originally covered in the WEIR project. Here                          objects are merged in the same linkage as long as the addition
we briefly recall the two properties and sketch how we plan to                          of the objects processed into an existing linkage will not lead
extend them.                                                                            to a violation of the LC property, i.e., two source objects from
     Local Consistency (LC) In a source there cannot be two                             the same source would end up being present in the same output
        distinct source attributes that refer to the same domain                        linkage (see Figure 5b).
        attribute. The dual property that we additionally assume is                         This algorithm exploits the duality of the matching and linkage
        that two distinct detail pages from the same source cannot                      problems, in this setting, and it is at the core of integration engine
        publish data about the same domain object.                                      for the Noah project. However, differently from WEIR, it does not
     Separable Domain (SD) In a mapping composed of several                             halt the integration as soon as a LC violation is detected: rather,
        extraction rules, each from a distinct source, and associ-                      it generates pipeline queries to confirm the choice, and continue
        ated with the same domain attribute, the instance-based                         the processing of all pairs in increasing order of distances, until
        distances between the rules of the mapping are always                           it is below a threshold over which no further matches/linkages
        smaller than distances with rules associated with a differ-                     are expected with meaningful distance functions.
        ent domain attribute. For computing the instance-based                              Unfortunately, as also recognized in WEIR [4], some domains
        distance, the object linkages are fixed and already known.                      have sources and attributes with very similar but semantically
        The dual property that we additionally assume is that in a                      different values (e.g., the resolution of the front/rear cameras in
        linkage composed of several source objects from distinct                        Figure 2). This situation easily lead to violation of the LC and SD
        sources and related to the same domain object, the pair-                        assumptions, and finding the mappings is a challenging problem
        wise attribute distances are always smaller than distances                      for many interesting domains.
    Example 3.8 (Non-separable Domains for Schema Matching). In              new objects or new attributes, additional costs might be incurred
Figure 2, source S1 and 𝑆 2 both have extraction rules ((𝑟 51, 𝑟 61 ),       to support the integration with existing IKB.
and (𝑟 42, 𝑟 52 ), respectively) with a low distance (Figure 5c) because        We are interested to study ML techniques that could decrease
camera resolutions (e.g., 1-front and 2-back) are typically within           crowdsourcing costs even in absence of redundancy. The main
a small range of values expressed in megapixel (MP). In Figure 5d            research area is that of synthesizing automatic responders ca-
it is shown that the pair of rules (𝑟 51, 𝑟 61 ) at distance 0.25 violates   pable of answering the same type of pipeline queries that are
the LC and DS assumptions because their distance is smaller than             normally posed to human responders for solving Noah tasks,
the distance of (𝑟 51, 𝑟 42 ) that is 0.27.                                  with the goal of progressively replacing human responders [7]
                                                                             and scaling the approach up to many thousands of sources.
   Actually, it is well known that the Record Linkage dual prob-
                                                                                Unfortunately, state-of-the-art ML unsupervised techniques [40,
lem, is even much more challenging than the Schema Matching
                                                                             42] can be adapted to provide accurate and reliable answers to
itself: the attributes containing the correct signals for considering
                                                                             those queries only if enough training data have been collected.
two objects equivalent can change from object to object even
                                                                             Indeed, fairness and bias, or simply misuse of machine learning
within the same source (think at smartphones of different brands
                                                                             algorithms, is a well-known problem in literature [12, 32] that
with different policies for naming the models and differentiating
                                                                             affects many development projects, especially in the scenarios
the features of each model). Assuming that every object in the do-
                                                                             which are most commonly found in practice [38]: pre-trained ML
main does not lead to a separability violation is quite unrealistic,
                                                                             models and/or enough training data are not available up-front, so
beside toy cases.
                                                                             that the ML models cannot be properly tuned and exhibit erratic
     Example 3.9 (Non-separable Domains for Object Linkage). In              and unpredictable performance [41].
Figure 5b the linkage (𝑝 11, 𝑝 12 ) is uncertain due to the presence            Snorkel [36] is another project exploiting the idea of leverag-
of 𝑝 22 . The two values (’MI 10’ vs ’MI 10T’) extracted by rule             ing human work to train ML algorithms. However, it is based on
𝑟 12 from pages 𝑝 12 and 𝑝 22 differ by a single letter: the wrong           the idea of engaging skilled workers in every step of the process-
linkage (𝑝 12, 𝑝 22 ) violating the LC property has a pairwise attribute     ing pipeline, while Noah aims at engaging non skilled workers
distance of only 0, 09 which is smaller than the distance of a               to whom can be interchangeably posed queries in the same form
the correct linkage (𝑝 11, 𝑝 12 ), and therefore the domain is not           as those posed to automatic responders. Several other projects
separable.                                                                   such as qodco [2] and SEER [22] have made use of crowdsourc-
                                                                             ing by mainly focusing on the problem of selecting the correct
   We believe that the violations of LC and DS assumptions can               extraction rules, while Noah applies the same query control
be manually fixed and that they help to find the most informative            methodology for all the tasks in the considered pipelines.
pipeline queries that need to be posed to external responders, i.e.,            It is also well known that by using automatic responders not
paid crowd workers, or suitably trained automatic responders.                accurate enough, it might turn out to be more expensive engaging
   By interleaving the dual linkage/matching algorithms in a loop            them than not using them at all, as additional human workers
in which external responders can contribute, as shown in Figure 4,           should be engaged only to offset their wrong answers [7].
each execution can contribute to improve the accuracy of the                    We envision a system in which crowd workers are used for
distance function used by the other task, either by improving the            indirectly controlling the deployment of automatic responders,
linkages used by the instance-based distance, or improving the               and the two types of responders are interchangeably engaged.
matches used by the pairwise attribute distance.                             Crowdsourcing workers contribute to collect domain data that
   Our vision is that with the precious help of crowdsourcing and            are then used to train and evaluate automatic responders, before
a loop of interleaving linkage/matching operations, the desired              fully deploying them. Automatic responders will progressively
target quality can be reached even in presence of non-separable              replace crowd workers to scale the approach and to lower the
domains: responders will be engaged to assess the quality of                 operating costs, but only after enough evidence that their accu-
the output, and to repair the uncertain choices made by the                  racy does not compromise the overall guaranteed output quality
integration algorithm. The linkages and matches confirmed by                 data. At regime, crowd workers will be minimally used only to
human feedback can be frozen and exploited in the following                  keep monitoring the performance of automatic responders.
iterations, somehow progressively solving and hence removing
from the domain the linkages or matches that made the domain                 We have identified several novel research challenges:
inseparable.

4    RESEARCH DIRECTIONS                                                         • formalizing and proving the correctness of an algorithm
                                                                                   that solves the full trio of extraction, matching and linkage
In the early stages of its life, the IKB K of a new Noah pipeline
                                                                                   tasks;
might be scarcely populated. As redundancy builds up over time
                                                                                 • creating and maintaining over time the continuous Web
with the addition of new sources to feed up the IKB, the accuracy
                                                                                   data processing pipelines at low costs, with guaranteed
of the extraction and integration process increases.
                                                                                   output quality;
   The absence of overlapping between objects and attributes
                                                                                 • designing several independent automatic responders based
published by a rather limited set of sources could limit the amount
                                                                                   on ML models that are capable of answering queries nor-
of available redundancy. In this situation, for operating the pipeline,
                                                                                   mally posed to crowd workers;
Noah would end up generating a lot of queries supporting the
                                                                                 • effectively measuring the available redundancy in a do-
system tasks. As an alternative solution, Noah supports the in-
                                                                                   main;
cremental addition of a source into an existing pipeline. A new
                                                                                 • estimating from the characteristics of a domain the crowd-
source might contribute to lower the overall costs if it signif-
                                                                                   sourcing costs necessary to obtain and maintain the de-
icantly overlaps with the sources already available for the do-
                                                                                   sired output quality.
main [14]. On the contrary, to integrate new sources publishing
REFERENCES                                                                             [22] Maeda F Hanafi, Azza Abouzied, Laura Chiticariu, and Yunyao Li. 2017. Syn-
 [1] Tara S Behrend, David J Sharek, Adam W Meade, and Eric N Wiebe. 2011. The              thesizing extraction rules from user examples with seer. In Proceedings of the
     viability of crowdsourcing for survey research. Behavior research methods 43,          2017 ACM International Conference on Management of Data. 1687–1690.
     3 (2011), 800.                                                                    [23] Bernd Heinrich, Marcus Kaiser, and Mathias Klier. 2007. How to measure data
 [2] Moria Bergman, Tova Milo, Slava Novgorodov, and Wang-Chiew Tan. 2015.                  quality? A metric-based approach. (2007).
     Query-oriented data cleaning with oracles. In Proceedings of the 2015 ACM         [24] Jeff Howe. 2006. The rise of crowdsourcing. Wired magazine 14, 6 (2006), 1–4.
     SIGMOD International Conference on Management of Data. 1199–1214.                 [25] Nicholas Kushmerick, Daniel S Weld, and Robert Doorenbos. 1997. Wrapper
 [3] Nikita Bhutani, Yoshihiko Suhara, Wang-Chiew Tan, Alon Halevy, and                     induction for information extraction. University of Washington Washington.
     Hosagrahar Visvesvaraya Jagadish. 2019. Open information extraction from          [26] Guoliang Li. 2017. Human-in-the-Loop Data Integration. Proc. VLDB Endow.
     question-answer pairs. arXiv preprint arXiv:1903.00172 (2019).                         10, 12 (Aug. 2017), 2006–2017. https://doi.org/10.14778/3137765.3137833
 [4] Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2013. Ex-      [27] Guoliang Li, Yudian Zheng, Ju Fan, Jiannan Wang, and Reynold Cheng. 2017.
     traction and integration of partially overlapping web sources. Proceedings of          Crowdsourced Data Management: Overview and Challenges. In Proceedings
     the VLDB Endowment 6, 10 (2013), 805–816.                                              of the 2017 ACM International Conference on Management of Data (SIGMOD
 [5] Valter Crescenzi, Alvaro AA Fernandes, Paolo Merialdo, and Norman W Paton.             ’17). Association for Computing Machinery, New York, NY, USA, 1711–1716.
     2017. Crowdsourcing for data management. Knowledge and Information                     https://doi.org/10.1145/3035918.3054776
                                                                                       [28] Colin Lockard, Xin Luna Dong, Arash Einolghozati, and Prashant Shiralkar.
     Systems 53, 1 (2017), 1–41.
                                                                                            2018. Ceres: Distantly supervised relation extraction from the semi-structured
 [6] Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. 2002. RoadRunner:
                                                                                            web. arXiv preprint arXiv:1804.04635 (2018).
     automatic data extraction from data-intensive web sites. In Proceedings of the
                                                                                       [29] Colin Lockard, Prashant Shiralkar, and Xin Luna Dong. 2019. Openceres: When
     2002 ACM SIGMOD international conference on Management of data. 624–624.
                                                                                            open information extraction meets the semi-structured web. In Proceedings
 [7] Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2019. Hybrid Crowd-
                                                                                            of the 2019 Conference of the North American Chapter of the Association for
     Machine Wrapper Inference. ACM Transactions on Knowledge Discovery from
                                                                                            Computational Linguistics: Human Language Technologies, Volume 1 (Long and
     Data (TKDD) 13, 5 (2019), 1–43.
                                                                                            Short Papers). 3047–3056.
 [8] Nilesh Dalvi, Ashwin Machanavajjhala, and Bo Pang. 2012. An analysis of
                                                                                       [30] Adam Marcus and Aditya Parameswaran. 2015. Crowdsourced data man-
     structured data on the web. arXiv preprint arXiv:1203.6406 (2012).
                                                                                            agement: Industry and academic perspectives. Foundations and Trends in
 [9] AnHai Doan. 2018. Human-in-the-Loop Data Analysis: A Personal Perspec-
                                                                                            Databases 6, 1-2 (2015), 1–161.
     tive. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics
                                                                                       [31] Mausam Mausam. 2016. Open information extraction systems and downstream
     (HILDA’18). Association for Computing Machinery, New York, NY, USA, Arti-
                                                                                            applications. In Proceedings of the twenty-fifth international joint conference on
     cle 1, 6 pages. https://doi.org/10.1145/3209900.3209913
                                                                                            artificial intelligence. 4074–4077.
[10] AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Sanjib Das, Yash Govind,
                                                                                       [32] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and
     Pradap Konda, Han Li, Erik Paulson, Paul Suganthan G. C., and Haojun
                                                                                            Aram Galstyan. 2019. A survey on bias and fairness in machine learning.
     Zhang. 2017. Toward a System Building Agenda for Data Integration.
                                                                                            arXiv preprint arXiv:1908.09635 (2019).
     arXiv:cs.DB/1710.00027
                                                                                       [33] Christina Niklaus, Matthias Cetto, André Freitas, and Siegfried Handschuh.
[11] AnHai Doan, Alon Halevy, and Zachary Ives. 2012. Principles of data integra-
                                                                                            2018. A survey on open information extraction. arXiv preprint arXiv:1806.05599
     tion. Elsevier.
                                                                                            (2018).
[12] Pedro Domingos. 2012. A Few Useful Things to Know about Machine Learning.
                                                                                       [34] Erhard Rahm and Philip Bernstein. 2001. A Survey of Approaches to Automatic
     Commun. ACM 55, 10 (Oct. 2012), 78–87. https://doi.org/10.1145/2347736.
                                                                                            Schema Matching. VLDB J. 10 (12 2001), 334–350. https://doi.org/10.1007/
     2347755
                                                                                            s007780100057
[13] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin
                                                                                       [35] Bahareh Rahmanian and Joseph G. Davis. 2014. User Interface Design for
     Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge
                                                                                            Crowdsourcing Systems. In Proceedings of the 2014 International Working
     vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings
                                                                                            Conference on Advanced Visual Interfaces (AVI ’14). Association for Computing
     of the 20th ACM SIGKDD international conference on Knowledge discovery and
                                                                                            Machinery, New York, NY, USA, 405–408. https://doi.org/10.1145/2598153.
     data mining. 601–610.
                                                                                            2602248
[14] Xin Dong, Barna Saha, and Divesh Srivastava. 2012. Less is more: Selecting
                                                                                       [36] Alexander J Ratner, Stephen H Bach, Henry R Ehrenberg, and Chris Ré. 2017.
     sources wisely for integration. Proceedings of the VLDB Endowment 6, 37–48.
                                                                                            Snorkel: Fast training set generation for information extraction. In Proceedings
[15] Xin Luna Dong and Divesh Srivastava. 2015. Big data integration. Synthesis
                                                                                            of the 2017 ACM international conference on management of data. 1683–1686.
     Lectures on Data Management 7, 1 (2015), 1–198.
                                                                                       [37] Michael Schmitz, Stephen Soderland, Robert Bart, Oren Etzioni, et al. 2012.
[16] Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying rela-
                                                                                            Open language learning for information extraction. In Proceedings of the 2012
     tions for open information extraction. In Proceedings of the 2011 conference on
                                                                                            Joint Conference on Empirical Methods in Natural Language Processing and
     empirical methods in natural language processing. 1535–1545.
                                                                                            Computational Natural Language Learning. 523–534.
[17] Wenfei Fan and Floris Geerts. 2012. Foundations of data quality management.
                                                                                       [38] Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann,
     Synthesis Lectures on Data Management 4, 5 (2012), 1–217.
                                                                                            Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska.
[18] Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi,
                                                                                            2019. Democratizing Data Science through Interactive Curation of ML
     Christian Schallhart, and Cheng Wang. 2014. DIADEM: thousands of websites
                                                                                            Pipelines. In Proceedings of the 2019 International Conference on Management
     to a single database. Proceedings of the VLDB Endowment 7, 14 (2014), 1845–
                                                                                            of Data (SIGMOD ’19). Association for Computing Machinery, New York, NY,
     1856.
                                                                                            USA, 1171–1188. https://doi.org/10.1145/3299869.3319863
[19] Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F Naughton, Narasimhan
                                                                                       [39] Kai-Sheng Teong, Lay-Ki Soon, and Tin Tin Su. 2020. Schema-Agnostic Entity
     Rampalli, Jude Shavlik, and Xiaojin Zhu. 2014. Corleone: hands-off crowdsourc-
                                                                                            Matching using Pre-trained Language Models. In Proceedings of the 29th ACM
     ing for entity matching. In Proceedings of the 2014 ACM SIGMOD international
                                                                                            International Conference on Information & Knowledge Management. 2241–2244.
     conference on Management of data. 601–612.
                                                                                       [40] Sebastian Thrun and Lorien Pratt. 2012. Learning to learn. Springer Science &
[20] Pankaj Gulhane, Amit Madaan, Rupesh Mehta, Jeyashankher Ramamirtham,
                                                                                            Business Media.
     Rajeev Rastogi, Sandeep Satpal, Srinivasan H Sengamedu, Ashwin Tengli, and
                                                                                       [41] Doris Xin, Litian Ma, Jialin Liu, Stephen Macke, Shuchen Song, and Aditya
     Charu Tiwari. 2011. Web-scale information extraction with vertex. In 2011
                                                                                            Parameswaran. 2018. Accelerating Human-in-the-Loop Machine Learning:
     IEEE 27th International Conference on Data Engineering. IEEE, 1209–1220.
                                                                                            Challenges and Opportunities. In Proceedings of the Second Workshop on Data
[21] Jinsong Guo, Valter Crescenzi, Tim Furche, Giovanni Grasso, and Georg Gott-
                                                                                            Management for End-To-End Machine Learning (DEEM’18). Association for
     lob. 2019. RED: Redundancy-Driven Data Extraction from Result Pages?. In
                                                                                            Computing Machinery, New York, NY, USA, Article 9, 4 pages. https://doi.
     The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May
                                                                                            org/10.1145/3209889.3209897
     13-17, 2019, Ling Liu, Ryen W. White, Amin Mantrach, Fabrizio Silvestri, Ju-
                                                                                       [42] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated
     lian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 605–615.
                                                                                            machine learning: Concept and applications. ACM Transactions on Intelligent
     https://doi.org/10.1145/3308558.3313529
                                                                                            Systems and Technology (TIST) 10, 2 (2019), 1–19.