=Paper=
{{Paper
|id=Vol-2841/PIE+Q_3
|storemode=property
|title=NOAH: Creating Data Integration Pipelines over Continuously Extracted Web Data
|pdfUrl=https://ceur-ws.org/Vol-2841/PIE+Q_3.pdf
|volume=Vol-2841
|authors=Valerio Cetorelli,Valter Crescenzi,Paolo Merialdo,Roger Voyat
|dblpUrl=https://dblp.org/rec/conf/edbt/CetorelliCMV21
}}
==NOAH: Creating Data Integration Pipelines over Continuously Extracted Web Data==
Noah: Creating Data Integration Pipelines over Continuously
Extracted Web Data
Valerio Cetorelli Valter Crescenzi
UniversitΓ Roma Tre UniversitΓ Roma Tre
valerio.cetorelli@uniroma3.it valter.crescenzi@uniroma3.it
Paolo Merialdo Roger Voyat
UniversitΓ Roma Tre UniversitΓ Roma Tre
paolo.merialdo@uniroma3.it roger.voyat@uniroma3.it
ABSTRACT In order to contain the crowdsourcing costs, the proposed
We present Noah, an ongoing research project aiming at devel- approach leverages two techniques. First, it exploits the inherent
oping a system for semi-automatically creating end-to-end Web redundancy of Web sources to automatically find correct domain
data processing pipelines. The pipelines continuously extract and information: data published by several independent sources are
integrate information from multiple sites by leveraging the re- more likely to be correct and can be easily discerned by noisy
dundancy of the data published on the Web. The system is based or non-relevant data [8, 15]. Secondly, it exploits the collected
on a novel hybrid human-machine learning approach in which data to continuously train ML models. Those ML models are pro-
the same type of questions can be interchangeably posed both gressively introduced in the form of automatic responders that
to human crowd workers and to automatic responders based on replace crowd workers [1, 30], and are continuously evaluated
machine learning (ML) models. Since the early stages of pipelines, during each step of the data processing pipelines: only respon-
crowd workers are engaged to guarantee the output data quality, ders that become sufficiently reliable are fully deployed in the
and to collect training data, that are then used to progressively operations of the created pipelines.
train and evaluate automatic responders. The latter are fully de-
ployed into the data processing pipelines to scale the approach
and to contain the crowdsourcing costs later. The combination
of guaranteed quality and progressive reductions of costs of the
pipelines generated by our system can improve the investments
and development processes of many applications that build on
the availability of such data processing pipelines.
Figure 1: Web detail pages in the Smartphone domain.
1 INTRODUCTION AND MOTIVATION
The Web is the largest knowledge base ever built by humans. Problem Description. Given a set of sources S = {π 1, π 2 . . .}
However, most of the data on the Web are not directly available from the same domain (e.g., Smartphones); each source ππ is spec-
to applications, unless complex data extraction and integration ified by means of ππ URLs of detail pages about domain objects
pipelines are set-up. Creating these pipelines to build structured (e.g., IPhone 12, Mi 10T). By detail page we mean a page report-
knowledge bases and continuously maintain them in a cost effec- ing information about a particular object, the topic entity [29]
tive way is still a challenging problem. Currently, most projects of the page, on which it publishes values of several attributes.
fulfill their data processing needs by means of case-by-case solu- An example of detail pages from two sources, about the same
tions that cannot be reused across projects. IPhone 12 domain object is shown in Figure 1 where the values of
This paper presents Noah, a research project that aims at devel- several attributes of interest such as Model, Memory, Price are
oping a system for creating and maintaining over time end-to-end highlighted.
data processing pipelines for continuously extracting and inte- A domain includes a set of objects O = {π 1, π 2, . . .} and a set
grating Web data. Noah is based on an hybrid human-machine of attributes A = {π΄1, π΄2, . . .} which will be populated with data
learning approach, whose goal is to guarantee the quality of pro- extracted from the pages of the sources belonging to that domain.
cessed data by leveraging feedbacks provided by human crowd New attributes and new objects of a domain can be discovered
workers. Our approach can be classified in the realm of Open as new sources are considered part of the domain.
Information Extraction [31], because it aims at extracting and Each source publishes detail pages reporting the values of a
integrating information both at the instance (objects) and at subset of domain attributes, for a subset of domain objects. We
the schema (attributes) levels into an internal knowledge base use the terms source attributes or source objects when we want to
(IKB) that is created, populated and maintained for every domain. denote the version of a domain attribute or object as published
Indeed, if new sources are incrementally added to an already by a source, i.e., we are referring to the occurrences of attribute
generated pipeline, the system is able to discover new entities values about an object as published by a source. It is worth notic-
and new attributes from the aforementioned sources. ing that some domain attribute can be published, possibly with
inconsistencies amongst the provided values, by several sources,
Β© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed- e.g., Model, while other attributes, e.g., ReviewScore or Price,
ings of the EDBT/ICDT 2021 Joint Conference (March 23β26, 2021, Nicosia, Cyprus) have values which are inherently source-specific.
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0) In the following, we identify source objects by means of the
URL of the detail page hosting its data, and we identify source
Figure 2: Running Example β The Smartphones domain includes 2 sources crawled at π instants. Over each source 6 correct
extraction rules working on several detail pages are given: π ππ (π = 1, . . . , 6) denotes the π-th rule working on source ππ , each
extracting the value of a source attribute from a detail page associated with a source object. For example, π 31 indicates the
page about IPhone 11 from source π 1 and rule π 21 extracts the Model from every page of the same source. At every time π‘, the
values extracted from the two sources are conveniently depicted as organized in tables: each row of the table is associated
with a detail page of the source, and each column is associated with an extraction rule around the same source. The set
of domain attributes includes: Model, Brand, Price, Memory, Camera 1, Camera 2. Correct linkages can be represented as pairs
of pages about the same domain objects: {(π 11, π 12 ), (ππ 1 , π 2 )}. Correct source attribute matches can be represented as pair of
4
correct extraction rules: {(π 1 , π 1 ), (π 2 , π 2 ), (π 4 , π 3 ), (π 5 , π 42 ), (π 61, π 52 )}.
1 2 1 2 1 2 1
attributes by means of a unique, within the domain, identifier The focus of our research project covers the three problems
of the extraction rule that is capable of locating its value from that we believe are at the core of any Web data integration
the detail page. By extraction rule we mean a function extracting pipeline: extraction, matching, and linkage. It does not include,
at most one value from a detail page. It does not matter the on one hand, the sources discovery problem, and the automatic
formalism, e.g., XPath expressions, in which it will be specified. synthesis of crawling programs; on the other hand, it does not
Our goal is that of continuously extracting data of guaranteed include the data fusion problem.
level of quality from the detail pages composing to sources. The Our solution can help several projects that need to set up and
data are reorganized into an Integrated Knowledge Graph (IKB) maintain over time Web data processing pipelines, but require a
while minimizing the overall costs. As a measure of data quality, guaranteed quality of the pipelinesβ output data to be business
we will use standard measures such as precision, recall, and πΉ - meaningful.
measure over integrated data [23]. As a measure of the cost, the Clearly, the amount of work outsourced to crowd workers
goal is that of minimizing the crowdsourcing costs [5, 27]. to guarantee the quality level largely depends on the inherent
In IKB the following information will be available: (linkages characteristic of the domain: those containing static attributes
and matches) how the source attributes and objects are respec- that are largely redundant from source to source can dramatically
tively mapped to the domain attributes and objects; (values prove- simplify domain data detection, extraction and schema matching;
nance) the source attribute values for every object in the domain. an attribute working as a soft identifier across several sources can
The problem we want to solve is that of continuously creating contribute significantly to reduce the cost of the record linkage
K π‘ , that is an IKB at every time π‘ in which the snapshots of the task for a domain (i.e, booksβ ISBN).
detail pages from every source in a domain D are gathered. We Unfortunately, it turns out that many interesting domains (e.g.,
illustrate the problem definition by means of a running example job postings, real estates, . . . ) do not exhibit such redundancy
shown in Figure 2. and the type of redundancy that the system has to exploit is
at an intensional level, i.e., type and format of values, range of
values, labels of extracted data. Generally speaking, separating
2 SCOPE, OPPORTUNITIES, CHALLENGES domain data from other information become largely dependent
on the context in which the attributes are proposed, and on
Building and maintaining effective data processing pipelines over
the availability of human feedback to check the correctness of
Web data is a challenging problem for several reasons. First, Web
proposed hypotheses.
sources are autonomous and remote: they can unpredictably
change and therefore break all the extraction rules created on
previous versions of the same source to extract data. Second, the Redundancy as OpenIE Enabler
set up of an integration pipeline requires to solve many inter- The redundancy plays a fundamental role in our system to keep
related tasks, each of which has motivated flurry of research the crowdsourcing costs at reasonable levels. Whenever redun-
works, including: sources discovery, data extraction, schema match- dancy of data across sources is properly detected and exploited,
ing, record linkage, data fusion, data labeling, and data cleaning. domain data can be discerned by other noisy or out-of-domain in-
Each of these problems has been extensively studied over the formation. For example, WEIR [4] assumes that linkages between
last decades, with tens, if not hundreds in some cases, of well- collection of pages from two sources are already known as part
recognized research works [6, 13, 19, 34, 39]. of the input, and then it exploits the redundancy of distinct and
independent sources that publish information about the same adopted ML algorithms but lack the amount and quality of train-
objects and attributes to automatically find correct extraction ing data, and the validation, needed to guarantee the desired
rules and schema matches. output quality.
Noah aims at escalating to the largest possible extent the
use of redundancy for extracting and integrating Web data as
pioneered by WEIR. It will exploit at least the following forms
for redundancy:
Intensional several sources publish the same domain at-
tributes
Extensional several sources publish information about the
same domain objects
Temporal a source publish data about the same domain
objects and attributes over time
Intra-source a source can publish data about the same ob-
jects in pages of distinct type, e.g., a result page containing Figure 3: Overview of Noah System & Pipelines created
snippet of records with most relevant attributes plus link
to detail pages containing all attributes [21]
At the same time, and with the help of human feedback, Noah 3 NOAH SYSTEM AND PIPELINES
aims at overcoming WEIRβs limitations by relaxing its rather The Noah system supports the semi-automatic generation of
strict underlying assumptions on the input domain: WEIR re- end-to-end Web data processing pipelines over several domains.
quires that enough intensional and extensional redundancy is Figure 3 shows how the system can generate and operate many
available to discern all domain data from all other information. pipelines at the same time, each having an IKB that is progres-
WEIR and Noah falls in the realm of the OpenIE approaches [3, sively and continuously populated with data coming from the
4, 16, 29, 33, 37]: unlike the ClosedIE approaches [18, 20, 25, 28] sources of the domain on which it operates. Our system will in-
where the managed knowledge base does not grow in terms of teract with external systems by means of two major components:
subjects and predicates but only in terms of values, new schema the Crawler, that continuously downloads snapshot of pages from
information, e.g., new domain attributes, can be progressively every source with a frequency specified by a cron expression;
discovered while populating the knowledge base with entities and the Crowd Manager, that manages the interactions with a
and values of schema already known. crowdsourcing platform.
There are two main differences between Noah and other Ope- During operations, Noah will generate pipeline queries for
nIE [29] systems: first, we do not require a pre-populated Knowl- the responders engaged by the crowdsourcing platform. The
edge Base, as we start from an empty IKB and we populate it as responders will contribute to solve the system tasks needed to set
new sources over the domain; second, we aim at continuously up and maintain new pipelines: for example, tasks are needed to
extracting and integrating data [11], as we believe that the tempo- select initial extraction rules over every domain source, select and
ral setting is important both for business reasons (many projects label the source attributes, finding the linkages between source
need continuous stream of data rather than snapshots), and for objects to a common mediated domain object, and matching the
taking into the main problem definition the maintenance costs of source attributes across several sources to a mediated domain
the generated pipelines over time, costs that are largely neglected attribute.
in many research proposals [29].
Despite many of the problems that need to be tackled to cre- System Tasks
ate our pipelines have already been extensively covered in the The main system tasks that need to be tackled to set up a Noah
research literature, we believe that semi-automatizing the cre- pipeline are shown in Figure 4: Page Linkage, Data Extraction,
ation of Web data processing pipelines can be still considered a Schema Matching, and Object Linkage.
relevant problem [10]. Page Linkage aims at obtaining a first approximate top-π page
We argue that if the costs and the guaranteed level of qual- linkages. Two pages have a linkage if they both publish data
ity [17] are explicitly considered, many projects relying on data related to the same domain object.
processing pipelines can be re-conducted into a much more con-
trollable investment and validation process, and their overall Example 3.1 (Page Linkage). In Figure 2 we can see two possi-
feasibility can be significantly improved because many business ble page linkages at time π‘π : {(π 11, π 12 ), (ππ
1 , π 2 )}. Their distances,
4
projects are strongly and directly affected by the cost of creating i.e., 0.09 and 0.12, are shown at the top of Figure 5a.
and maintaining the underlying Web data processing pipelines. Data Extraction aims at finding all the correct extraction rules.
Moreover, we believe that by posing to human and automatic It generates all the possible extraction rules and discover the
responders the same type of queries, they become interchange- correct ones by exploiting the redundancy of published data
able enough to motivate the study of new deployment methodolo- across several independent sources [4] when available, while
gies for Web data processing pipelines. The goal of such method- querying the responders [7] to confirm uncertain hypotheses.
ologies is to progressively lowering the crowdsourcing costs by Schema Matching aims at finding matches between extraction
means of machine-learning techniques while keeping under con- rules by exploiting an instance-based distance measure between
trol the output quality level since the early stages of the deployed source objects. The instance-based distance between two extrac-
pipelines. Indeed, many development projects often experience tion rules assumes the availability of correct object linkages to
unpredictable and erratic time-to-market (TTM) and return-on- align source objects related to the same domain object, as pro-
investment (ROI) because, especially in the early stages, they duced in output by the next system task: the distance is obtained
Figure 4: Running example (Pipeline example with queries): tasks provided by system and query generated for hybrid
human-machine responders.
by averaging the distance between extracted values over all the with values extracted from two pages of distinct sources, e.g.,
aligned detail pages. using extraction rules (π 51, π 42 ). These are two detail pages con-
sidered in a linkage, and "MI 10" is the name associated with the
Example 3.2 (Schema Matching). Consider source π 1 and π 2
corresponding domain object.
at time π‘π and the set of page linkages {(π 11, π 12 ), (ππ 1 , π 2 )} in
4
Figure 2: possible matches are {(π 1 , π 1 ), (π 4 , π 3 ), (π 4 , π 32 )}. The
1 2 1 2 1 Example 3.5 (Page Linkage Query). A query such as βDo these
pairwise attribute distances, i.e., 0.19 and 0.22, are shown at two pages refer to the same object?β posed to human responders in
Figure 4 can validate or refute a page linkage (ππ1 , π 2 ). In order
the top of Figure 5c. 4
for the query to be as simple as possible [35], we can show the
Object Linkage aims at finding linkages between source objects user a screenshot of the original pages.
by exploiting a pairwise attribute distance measure between
source attributes. The pairwise attribute distance between two Example 3.6 (Object Linkage Query). Unlike the case of page
source objects assumes the availability of correct schema matches linkage tasks above, here the query is posed directly on source
across the extraction rules to align source attributes related to objects with extracted values. A query such as βDo these 2 objects
the same domain attribute, as produced in output by the previous refer to the same?β posed to human responders in Figure 4 can
system task: the distance is obtained by averaging the distance validate or refute an object linkage (π 11, π 12 ). To make the query as
between the two values over all matching attributes. simple as possible for an human responder, it is shown together
We name the linkage/matching loop of system tasks Link- with two records whose attributes have been already aligned by
age/Matching Duality; we further discuss it in Section 3.1. leveraging the results of a schema matching task.
The tremendous success of crowdsourcing [24] can be partially
Pipeline Queries explained by saying that human supervision can represent the
For every system task necessary to set up and maintain a pipeline, essential final ingredient to unmask those problems really hard to
Noah tries to solve it by using a human-in-the-loop approach [9, solve through automatic algorithms but that can be transformed
26]: unsupervised algorithms will generate most-likely hypothe- into rather simple questions for human workers. However, it is
sis based on the available redundancy. These hypothesis are later well known that in practice, the availability and the accuracy of
confuted or validated by means of queries posed to responders, crowd workers, especially of unskilled ones, is strongly depen-
initially only human responders, and later, also by using auto- dent on the way the questions are posed and rewarded [35]. One
matic responders based on ML models that have been trained of the Noah goal is that of exploiting IKB, which is progressively
with the data collected while operating the Noah pipeline (see built, also to make the crowdsourcing queries as simple as pos-
Section 4). sible. For example, a query to check a record linkage exploits
An example of the queries posed to the responders for every the schema matching already computed to make the two records
system task is shown in Figure 4: Page Linkage, Data Extraction, easy to be visually compared.
Schema Matching and Object Linkage.
3.1 Linkage / Matching Duality
Example 3.3 (Data Extraction Query). Figure 4 shows an ex-
Figure 4 shows that two important integration tasks operated by
ample of query for Data Extraction tasks. The uncertainty of an
Noah pipelines, i.e., Schema Matching and Object Linkage, are
extraction rule generated by wrapper inference can be validated
part of a loop in which each one assumes the availability of the
by checking the extracted value on a detail page by means of a
output of the other to solve its own task. Page Linkage is the
query such as: "Is β1050$β a Price?", where Price is a candidate
system task outside the loop needed for its initial triggering.
label for the extraction rule and β1050 $β is the extracted value.
We assume available two normalized distance functions pro-
Example 3.4 (Schema Matching Query). Figure 4 shows that viding a value between 0 and 1 when comparing two rules, and
schema matching tasks can be solved by means of queries con- two source objects (records), respectively: the instance-based dis-
firming or refuting a single match: βDo "108MP" and "20MP" tance and the pairwise attribute distance. The former compares
refer to the same attribute of object "MI 10"?β. The template of two rules over the values they extract from a set of detail pages
the query to support a schema matching task has been filled up which have been previously aligned, i.e., their linkages are fixed.
(a) Linkages Distances (b) Page Linkages over 2 Attributes (c) Matches Distances (d) Matches over 2 Linkages
Figure 5: Running example (Distance Similarity): 5a and 5c show distances in Pyramids; 5b and 5d expose relations in
Cartesian Plane where βUncertaintiesβ are due to the breaking of LC with Non-separable Domain
The latter compares two source objects over the values of some with source objects associated with a different domain
of their attributes which have been previously aligned, i.e., their object. For computing the pairwise attribute distance, the
matches are fixed. source attribute matches are fixed and already known.
Example 3.7 (Normalized Distance Functions). Instance-based For domains in which such properties hold, the WEIR system
distance: let (ππ 1 , π 2 ) and (π 1 , π 2 ) be two given correct linkages
4 1 1 is able to match the extraction rules and build their mappings into
for the detail pages associated with IPhone 12 and MI 10 source cluster of source attributes related to the same domain attribute
object from source π 1 and π 2 as shown in Figure 5d. The distance by comparing all the similarity distances, while at the same time,
between the rules (π 51, π 42 ) can be computed as follows: π (π 51, π 42 ) = it can separate the correct extraction rules from noisy ones. The
π (π 51 (π 11 ), π 42 (π 12 )) + π (π 51 (ππ
1 ), π 2 (π 2 )) = π ( β108MPβ, β108MPβ) +
4 4 idea is pretty simple and depicted in Figure 5: DS suggests to
π ( β12MPβ, β14MPβ) = 2.9. The normalized distance in the range [0, 1] sort the set of all possible matches (pair of extraction rules) by
is 0.27. an instance-based distance leveraging the alignment of detail
Pairwise attribute distance: let (π 22, π 21 ) and (π 11, π 12 ) be two given pages (see Figure 5c). Those pairs are then processed in order of
correct matches for Brand and Model attributes (see Figure 5b). increasing distances: every pair of rules are merged in the same
The distance between the two source objects about MI 10 PRO and mapping as long as the addition of the rules will not lead to a
MI 10T can be computed as follows: π (π 21, π 22 ) = π (π 21 (π 21 ), π 22 (π 22 ))+ violation of the LC property, i.e., two rules (source attributes)
π (π 11 (π 21 ), π 12 (π 22 )) = π ( βXIAOMIβ, βXIAOMIβ) +π ( βMI 10 PROβ, βMI 10Tβ) from the same source would end up being present in the same
= 3.2. The normalized distance in the range [0, 1] is 0.27. output mapping (see Figure 5d). For certain domains, with suf-
We revisit and propose an extension of two domain properties, ficiently overlapping sources, WEIR can automatically find the
called Local Consistency and Separable Domain, underlying the correct extraction rules and their matching with rules over other
formal approach presented in WEIR [4] for solving the extraction sources provided that the correct linkages between detail pages
and matching problem when the page linkage is given as input. are known.
Our ambition is twofold: on the one side, we aim to extend The dual algorithm will solve the problem of finding correct
that approach to cover the whole trio of extraction, matching object linkages provided that correct schema matches between
and linkage problems at the core of Noah pipelines; on the other source attributes are given as depicted in Figure 5: DS suggests
hand, we want to relax the underlying assumptions by mean of to sort the set of all possible linkages (pair of source objects) by a
the feedback provided by human crowd workers, so making the pairwise attribute distance (see Figure 5a). Those pairs are then
approach adaptable to domains with more disparate character- processed in order of increasing distances: every pair of source
istics that those originally covered in the WEIR project. Here objects are merged in the same linkage as long as the addition
we briefly recall the two properties and sketch how we plan to of the objects processed into an existing linkage will not lead
extend them. to a violation of the LC property, i.e., two source objects from
Local Consistency (LC) In a source there cannot be two the same source would end up being present in the same output
distinct source attributes that refer to the same domain linkage (see Figure 5b).
attribute. The dual property that we additionally assume is This algorithm exploits the duality of the matching and linkage
that two distinct detail pages from the same source cannot problems, in this setting, and it is at the core of integration engine
publish data about the same domain object. for the Noah project. However, differently from WEIR, it does not
Separable Domain (SD) In a mapping composed of several halt the integration as soon as a LC violation is detected: rather,
extraction rules, each from a distinct source, and associ- it generates pipeline queries to confirm the choice, and continue
ated with the same domain attribute, the instance-based the processing of all pairs in increasing order of distances, until
distances between the rules of the mapping are always it is below a threshold over which no further matches/linkages
smaller than distances with rules associated with a differ- are expected with meaningful distance functions.
ent domain attribute. For computing the instance-based Unfortunately, as also recognized in WEIR [4], some domains
distance, the object linkages are fixed and already known. have sources and attributes with very similar but semantically
The dual property that we additionally assume is that in a different values (e.g., the resolution of the front/rear cameras in
linkage composed of several source objects from distinct Figure 2). This situation easily lead to violation of the LC and SD
sources and related to the same domain object, the pair- assumptions, and finding the mappings is a challenging problem
wise attribute distances are always smaller than distances for many interesting domains.
Example 3.8 (Non-separable Domains for Schema Matching). In new objects or new attributes, additional costs might be incurred
Figure 2, source S1 and π 2 both have extraction rules ((π 51, π 61 ), to support the integration with existing IKB.
and (π 42, π 52 ), respectively) with a low distance (Figure 5c) because We are interested to study ML techniques that could decrease
camera resolutions (e.g., 1-front and 2-back) are typically within crowdsourcing costs even in absence of redundancy. The main
a small range of values expressed in megapixel (MP). In Figure 5d research area is that of synthesizing automatic responders ca-
it is shown that the pair of rules (π 51, π 61 ) at distance 0.25 violates pable of answering the same type of pipeline queries that are
the LC and DS assumptions because their distance is smaller than normally posed to human responders for solving Noah tasks,
the distance of (π 51, π 42 ) that is 0.27. with the goal of progressively replacing human responders [7]
and scaling the approach up to many thousands of sources.
Actually, it is well known that the Record Linkage dual prob-
Unfortunately, state-of-the-art ML unsupervised techniques [40,
lem, is even much more challenging than the Schema Matching
42] can be adapted to provide accurate and reliable answers to
itself: the attributes containing the correct signals for considering
those queries only if enough training data have been collected.
two objects equivalent can change from object to object even
Indeed, fairness and bias, or simply misuse of machine learning
within the same source (think at smartphones of different brands
algorithms, is a well-known problem in literature [12, 32] that
with different policies for naming the models and differentiating
affects many development projects, especially in the scenarios
the features of each model). Assuming that every object in the do-
which are most commonly found in practice [38]: pre-trained ML
main does not lead to a separability violation is quite unrealistic,
models and/or enough training data are not available up-front, so
beside toy cases.
that the ML models cannot be properly tuned and exhibit erratic
Example 3.9 (Non-separable Domains for Object Linkage). In and unpredictable performance [41].
Figure 5b the linkage (π 11, π 12 ) is uncertain due to the presence Snorkel [36] is another project exploiting the idea of leverag-
of π 22 . The two values (βMI 10β vs βMI 10Tβ) extracted by rule ing human work to train ML algorithms. However, it is based on
π 12 from pages π 12 and π 22 differ by a single letter: the wrong the idea of engaging skilled workers in every step of the process-
linkage (π 12, π 22 ) violating the LC property has a pairwise attribute ing pipeline, while Noah aims at engaging non skilled workers
distance of only 0, 09 which is smaller than the distance of a to whom can be interchangeably posed queries in the same form
the correct linkage (π 11, π 12 ), and therefore the domain is not as those posed to automatic responders. Several other projects
separable. such as qodco [2] and SEER [22] have made use of crowdsourc-
ing by mainly focusing on the problem of selecting the correct
We believe that the violations of LC and DS assumptions can extraction rules, while Noah applies the same query control
be manually fixed and that they help to find the most informative methodology for all the tasks in the considered pipelines.
pipeline queries that need to be posed to external responders, i.e., It is also well known that by using automatic responders not
paid crowd workers, or suitably trained automatic responders. accurate enough, it might turn out to be more expensive engaging
By interleaving the dual linkage/matching algorithms in a loop them than not using them at all, as additional human workers
in which external responders can contribute, as shown in Figure 4, should be engaged only to offset their wrong answers [7].
each execution can contribute to improve the accuracy of the We envision a system in which crowd workers are used for
distance function used by the other task, either by improving the indirectly controlling the deployment of automatic responders,
linkages used by the instance-based distance, or improving the and the two types of responders are interchangeably engaged.
matches used by the pairwise attribute distance. Crowdsourcing workers contribute to collect domain data that
Our vision is that with the precious help of crowdsourcing and are then used to train and evaluate automatic responders, before
a loop of interleaving linkage/matching operations, the desired fully deploying them. Automatic responders will progressively
target quality can be reached even in presence of non-separable replace crowd workers to scale the approach and to lower the
domains: responders will be engaged to assess the quality of operating costs, but only after enough evidence that their accu-
the output, and to repair the uncertain choices made by the racy does not compromise the overall guaranteed output quality
integration algorithm. The linkages and matches confirmed by data. At regime, crowd workers will be minimally used only to
human feedback can be frozen and exploited in the following keep monitoring the performance of automatic responders.
iterations, somehow progressively solving and hence removing
from the domain the linkages or matches that made the domain We have identified several novel research challenges:
inseparable.
4 RESEARCH DIRECTIONS β’ formalizing and proving the correctness of an algorithm
that solves the full trio of extraction, matching and linkage
In the early stages of its life, the IKB K of a new Noah pipeline
tasks;
might be scarcely populated. As redundancy builds up over time
β’ creating and maintaining over time the continuous Web
with the addition of new sources to feed up the IKB, the accuracy
data processing pipelines at low costs, with guaranteed
of the extraction and integration process increases.
output quality;
The absence of overlapping between objects and attributes
β’ designing several independent automatic responders based
published by a rather limited set of sources could limit the amount
on ML models that are capable of answering queries nor-
of available redundancy. In this situation, for operating the pipeline,
mally posed to crowd workers;
Noah would end up generating a lot of queries supporting the
β’ effectively measuring the available redundancy in a do-
system tasks. As an alternative solution, Noah supports the in-
main;
cremental addition of a source into an existing pipeline. A new
β’ estimating from the characteristics of a domain the crowd-
source might contribute to lower the overall costs if it signif-
sourcing costs necessary to obtain and maintain the de-
icantly overlaps with the sources already available for the do-
sired output quality.
main [14]. On the contrary, to integrate new sources publishing
REFERENCES [22] Maeda F Hanafi, Azza Abouzied, Laura Chiticariu, and Yunyao Li. 2017. Syn-
[1] Tara S Behrend, David J Sharek, Adam W Meade, and Eric N Wiebe. 2011. The thesizing extraction rules from user examples with seer. In Proceedings of the
viability of crowdsourcing for survey research. Behavior research methods 43, 2017 ACM International Conference on Management of Data. 1687β1690.
3 (2011), 800. [23] Bernd Heinrich, Marcus Kaiser, and Mathias Klier. 2007. How to measure data
[2] Moria Bergman, Tova Milo, Slava Novgorodov, and Wang-Chiew Tan. 2015. quality? A metric-based approach. (2007).
Query-oriented data cleaning with oracles. In Proceedings of the 2015 ACM [24] Jeff Howe. 2006. The rise of crowdsourcing. Wired magazine 14, 6 (2006), 1β4.
SIGMOD International Conference on Management of Data. 1199β1214. [25] Nicholas Kushmerick, Daniel S Weld, and Robert Doorenbos. 1997. Wrapper
[3] Nikita Bhutani, Yoshihiko Suhara, Wang-Chiew Tan, Alon Halevy, and induction for information extraction. University of Washington Washington.
Hosagrahar Visvesvaraya Jagadish. 2019. Open information extraction from [26] Guoliang Li. 2017. Human-in-the-Loop Data Integration. Proc. VLDB Endow.
question-answer pairs. arXiv preprint arXiv:1903.00172 (2019). 10, 12 (Aug. 2017), 2006β2017. https://doi.org/10.14778/3137765.3137833
[4] Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2013. Ex- [27] Guoliang Li, Yudian Zheng, Ju Fan, Jiannan Wang, and Reynold Cheng. 2017.
traction and integration of partially overlapping web sources. Proceedings of Crowdsourced Data Management: Overview and Challenges. In Proceedings
the VLDB Endowment 6, 10 (2013), 805β816. of the 2017 ACM International Conference on Management of Data (SIGMOD
[5] Valter Crescenzi, Alvaro AA Fernandes, Paolo Merialdo, and Norman W Paton. β17). Association for Computing Machinery, New York, NY, USA, 1711β1716.
2017. Crowdsourcing for data management. Knowledge and Information https://doi.org/10.1145/3035918.3054776
[28] Colin Lockard, Xin Luna Dong, Arash Einolghozati, and Prashant Shiralkar.
Systems 53, 1 (2017), 1β41.
2018. Ceres: Distantly supervised relation extraction from the semi-structured
[6] Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. 2002. RoadRunner:
web. arXiv preprint arXiv:1804.04635 (2018).
automatic data extraction from data-intensive web sites. In Proceedings of the
[29] Colin Lockard, Prashant Shiralkar, and Xin Luna Dong. 2019. Openceres: When
2002 ACM SIGMOD international conference on Management of data. 624β624.
open information extraction meets the semi-structured web. In Proceedings
[7] Valter Crescenzi, Paolo Merialdo, and Disheng Qiu. 2019. Hybrid Crowd-
of the 2019 Conference of the North American Chapter of the Association for
Machine Wrapper Inference. ACM Transactions on Knowledge Discovery from
Computational Linguistics: Human Language Technologies, Volume 1 (Long and
Data (TKDD) 13, 5 (2019), 1β43.
Short Papers). 3047β3056.
[8] Nilesh Dalvi, Ashwin Machanavajjhala, and Bo Pang. 2012. An analysis of
[30] Adam Marcus and Aditya Parameswaran. 2015. Crowdsourced data man-
structured data on the web. arXiv preprint arXiv:1203.6406 (2012).
agement: Industry and academic perspectives. Foundations and Trends in
[9] AnHai Doan. 2018. Human-in-the-Loop Data Analysis: A Personal Perspec-
Databases 6, 1-2 (2015), 1β161.
tive. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics
[31] Mausam Mausam. 2016. Open information extraction systems and downstream
(HILDAβ18). Association for Computing Machinery, New York, NY, USA, Arti-
applications. In Proceedings of the twenty-fifth international joint conference on
cle 1, 6 pages. https://doi.org/10.1145/3209900.3209913
artificial intelligence. 4074β4077.
[10] AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Sanjib Das, Yash Govind,
[32] Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and
Pradap Konda, Han Li, Erik Paulson, Paul Suganthan G. C., and Haojun
Aram Galstyan. 2019. A survey on bias and fairness in machine learning.
Zhang. 2017. Toward a System Building Agenda for Data Integration.
arXiv preprint arXiv:1908.09635 (2019).
arXiv:cs.DB/1710.00027
[33] Christina Niklaus, Matthias Cetto, AndrΓ© Freitas, and Siegfried Handschuh.
[11] AnHai Doan, Alon Halevy, and Zachary Ives. 2012. Principles of data integra-
2018. A survey on open information extraction. arXiv preprint arXiv:1806.05599
tion. Elsevier.
(2018).
[12] Pedro Domingos. 2012. A Few Useful Things to Know about Machine Learning.
[34] Erhard Rahm and Philip Bernstein. 2001. A Survey of Approaches to Automatic
Commun. ACM 55, 10 (Oct. 2012), 78β87. https://doi.org/10.1145/2347736.
Schema Matching. VLDB J. 10 (12 2001), 334β350. https://doi.org/10.1007/
2347755
s007780100057
[13] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin
[35] Bahareh Rahmanian and Joseph G. Davis. 2014. User Interface Design for
Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge
Crowdsourcing Systems. In Proceedings of the 2014 International Working
vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings
Conference on Advanced Visual Interfaces (AVI β14). Association for Computing
of the 20th ACM SIGKDD international conference on Knowledge discovery and
Machinery, New York, NY, USA, 405β408. https://doi.org/10.1145/2598153.
data mining. 601β610.
2602248
[14] Xin Dong, Barna Saha, and Divesh Srivastava. 2012. Less is more: Selecting
[36] Alexander J Ratner, Stephen H Bach, Henry R Ehrenberg, and Chris RΓ©. 2017.
sources wisely for integration. Proceedings of the VLDB Endowment 6, 37β48.
Snorkel: Fast training set generation for information extraction. In Proceedings
[15] Xin Luna Dong and Divesh Srivastava. 2015. Big data integration. Synthesis
of the 2017 ACM international conference on management of data. 1683β1686.
Lectures on Data Management 7, 1 (2015), 1β198.
[37] Michael Schmitz, Stephen Soderland, Robert Bart, Oren Etzioni, et al. 2012.
[16] Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying rela-
Open language learning for information extraction. In Proceedings of the 2012
tions for open information extraction. In Proceedings of the 2011 conference on
Joint Conference on Empirical Methods in Natural Language Processing and
empirical methods in natural language processing. 1535β1545.
Computational Natural Language Learning. 523β534.
[17] Wenfei Fan and Floris Geerts. 2012. Foundations of data quality management.
[38] Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann,
Synthesis Lectures on Data Management 4, 5 (2012), 1β217.
Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska.
[18] Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi,
2019. Democratizing Data Science through Interactive Curation of ML
Christian Schallhart, and Cheng Wang. 2014. DIADEM: thousands of websites
Pipelines. In Proceedings of the 2019 International Conference on Management
to a single database. Proceedings of the VLDB Endowment 7, 14 (2014), 1845β
of Data (SIGMOD β19). Association for Computing Machinery, New York, NY,
1856.
USA, 1171β1188. https://doi.org/10.1145/3299869.3319863
[19] Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F Naughton, Narasimhan
[39] Kai-Sheng Teong, Lay-Ki Soon, and Tin Tin Su. 2020. Schema-Agnostic Entity
Rampalli, Jude Shavlik, and Xiaojin Zhu. 2014. Corleone: hands-off crowdsourc-
Matching using Pre-trained Language Models. In Proceedings of the 29th ACM
ing for entity matching. In Proceedings of the 2014 ACM SIGMOD international
International Conference on Information & Knowledge Management. 2241β2244.
conference on Management of data. 601β612.
[40] Sebastian Thrun and Lorien Pratt. 2012. Learning to learn. Springer Science &
[20] Pankaj Gulhane, Amit Madaan, Rupesh Mehta, Jeyashankher Ramamirtham,
Business Media.
Rajeev Rastogi, Sandeep Satpal, Srinivasan H Sengamedu, Ashwin Tengli, and
[41] Doris Xin, Litian Ma, Jialin Liu, Stephen Macke, Shuchen Song, and Aditya
Charu Tiwari. 2011. Web-scale information extraction with vertex. In 2011
Parameswaran. 2018. Accelerating Human-in-the-Loop Machine Learning:
IEEE 27th International Conference on Data Engineering. IEEE, 1209β1220.
Challenges and Opportunities. In Proceedings of the Second Workshop on Data
[21] Jinsong Guo, Valter Crescenzi, Tim Furche, Giovanni Grasso, and Georg Gott-
Management for End-To-End Machine Learning (DEEMβ18). Association for
lob. 2019. RED: Redundancy-Driven Data Extraction from Result Pages?. In
Computing Machinery, New York, NY, USA, Article 9, 4 pages. https://doi.
The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, May
org/10.1145/3209889.3209897
13-17, 2019, Ling Liu, Ryen W. White, Amin Mantrach, Fabrizio Silvestri, Ju-
[42] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated
lian J. McAuley, Ricardo Baeza-Yates, and Leila Zia (Eds.). ACM, 605β615.
machine learning: Concept and applications. ACM Transactions on Intelligent
https://doi.org/10.1145/3308558.3313529
Systems and Technology (TIST) 10, 2 (2019), 1β19.