<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Noah: Creating Data Integration Pipelines over Continuously Extracted Web Data</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Valerio</forename><surname>Cetorelli</surname></persName>
							<email>valerio.cetorelli@uniroma3.it</email>
							<affiliation key="aff0">
								<orgName type="institution">Università Roma Tre</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Valter</forename><surname>Crescenzi</surname></persName>
							<email>valter.crescenzi@uniroma3.it</email>
							<affiliation key="aff1">
								<orgName type="institution">Università Roma Tre</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Paolo</forename><surname>Merialdo</surname></persName>
							<email>paolo.merialdo@uniroma3.it</email>
							<affiliation key="aff2">
								<orgName type="institution">Università Roma Tre</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Roger</forename><surname>Voyat</surname></persName>
							<email>roger.voyat@uniroma3.it</email>
							<affiliation key="aff3">
								<orgName type="institution">Università Roma Tre</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Noah: Creating Data Integration Pipelines over Continuously Extracted Web Data</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">D23CC8A54D94908B6324C78DE58ABD45</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T01:25+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We present Noah, an ongoing research project aiming at developing a system for semi-automatically creating end-to-end Web data processing pipelines. The pipelines continuously extract and integrate information from multiple sites by leveraging the redundancy of the data published on the Web. The system is based on a novel hybrid human-machine learning approach in which the same type of questions can be interchangeably posed both to human crowd workers and to automatic responders based on machine learning (ML) models. Since the early stages of pipelines, crowd workers are engaged to guarantee the output data quality, and to collect training data, that are then used to progressively train and evaluate automatic responders. The latter are fully deployed into the data processing pipelines to scale the approach and to contain the crowdsourcing costs later. The combination of guaranteed quality and progressive reductions of costs of the pipelines generated by our system can improve the investments and development processes of many applications that build on the availability of such data processing pipelines.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION AND MOTIVATION</head><p>The Web is the largest knowledge base ever built by humans. However, most of the data on the Web are not directly available to applications, unless complex data extraction and integration pipelines are set-up. Creating these pipelines to build structured knowledge bases and continuously maintain them in a cost effective way is still a challenging problem. Currently, most projects fulfill their data processing needs by means of case-by-case solutions that cannot be reused across projects.</p><p>This paper presents Noah, a research project that aims at developing a system for creating and maintaining over time end-to-end data processing pipelines for continuously extracting and integrating Web data. Noah is based on an hybrid human-machine learning approach, whose goal is to guarantee the quality of processed data by leveraging feedbacks provided by human crowd workers. Our approach can be classified in the realm of Open Information Extraction <ref type="bibr" target="#b30">[31]</ref>, because it aims at extracting and integrating information both at the instance (objects) and at the schema (attributes) levels into an internal knowledge base (IKB) that is created, populated and maintained for every domain. Indeed, if new sources are incrementally added to an already generated pipeline, the system is able to discover new entities and new attributes from the aforementioned sources.</p><p>In order to contain the crowdsourcing costs, the proposed approach leverages two techniques. First, it exploits the inherent redundancy of Web sources to automatically find correct domain information: data published by several independent sources are more likely to be correct and can be easily discerned by noisy or non-relevant data <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b14">15]</ref>. Secondly, it exploits the collected data to continuously train ML models. Those ML models are progressively introduced in the form of automatic responders that replace crowd workers <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b29">30]</ref>, and are continuously evaluated during each step of the data processing pipelines: only responders that become sufficiently reliable are fully deployed in the operations of the created pipelines. Problem Description. Given a set of sources S = {𝑆 1 , 𝑆 2 . . .} from the same domain (e.g., Smartphones); each source 𝑆 𝑖 is specified by means of 𝑛 𝑖 URLs of detail pages about domain objects (e.g., IPhone 12, Mi 10T). By detail page we mean a page reporting information about a particular object, the topic entity <ref type="bibr" target="#b28">[29]</ref> of the page, on which it publishes values of several attributes. An example of detail pages from two sources, about the same IPhone 12 domain object is shown in Figure <ref type="figure" target="#fig_0">1</ref> where the values of several attributes of interest such as Model, Memory, Price are highlighted.</p><p>A domain includes a set of objects O = {𝑜 1 , 𝑜 2 , . . .} and a set of attributes A = {𝐴 1 , 𝐴 2 , . . .} which will be populated with data extracted from the pages of the sources belonging to that domain. New attributes and new objects of a domain can be discovered as new sources are considered part of the domain.</p><p>Each source publishes detail pages reporting the values of a subset of domain attributes, for a subset of domain objects. We use the terms source attributes or source objects when we want to denote the version of a domain attribute or object as published by a source, i.e., we are referring to the occurrences of attribute values about an object as published by a source. It is worth noticing that some domain attribute can be published, possibly with inconsistencies amongst the provided values, by several sources, e.g., Model, while other attributes, e.g., ReviewScore or Price, have values which are inherently source-specific.</p><p>In the following, we identify source objects by means of the URL of the detail page hosting its data, and we identify source   attributes by means of a unique, within the domain, identifier of the extraction rule that is capable of locating its value from the detail page. By extraction rule we mean a function extracting at most one value from a detail page. It does not matter the formalism, e.g., XPath expressions, in which it will be specified.</p><p>Our goal is that of continuously extracting data of guaranteed level of quality from the detail pages composing to sources. The data are reorganized into an Integrated Knowledge Graph (IKB) while minimizing the overall costs. As a measure of data quality, we will use standard measures such as precision, recall, and 𝐹measure over integrated data <ref type="bibr" target="#b22">[23]</ref>. As a measure of the cost, the goal is that of minimizing the crowdsourcing costs <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b26">27]</ref>.</p><p>In IKB the following information will be available: (linkages and matches) how the source attributes and objects are respectively mapped to the domain attributes and objects; (values provenance) the source attribute values for every object in the domain.</p><p>The problem we want to solve is that of continuously creating K 𝑡 , that is an IKB at every time 𝑡 in which the snapshots of the detail pages from every source in a domain D are gathered. We illustrate the problem definition by means of a running example shown in Figure <ref type="figure" target="#fig_1">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">SCOPE, OPPORTUNITIES, CHALLENGES</head><p>Building and maintaining effective data processing pipelines over Web data is a challenging problem for several reasons. First, Web sources are autonomous and remote: they can unpredictably change and therefore break all the extraction rules created on previous versions of the same source to extract data. Second, the set up of an integration pipeline requires to solve many interrelated tasks, each of which has motivated flurry of research works, including: sources discovery, data extraction, schema matching, record linkage, data fusion, data labeling, and data cleaning. Each of these problems has been extensively studied over the last decades, with tens, if not hundreds in some cases, of wellrecognized research works <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b18">19,</ref><ref type="bibr" target="#b33">34,</ref><ref type="bibr" target="#b38">39]</ref>.</p><p>The focus of our research project covers the three problems that we believe are at the core of any Web data integration pipeline: extraction, matching, and linkage. It does not include, on one hand, the sources discovery problem, and the automatic synthesis of crawling programs; on the other hand, it does not include the data fusion problem.</p><p>Our solution can help several projects that need to set up and maintain over time Web data processing pipelines, but require a guaranteed quality of the pipelines' output data to be business meaningful.</p><p>Clearly, the amount of work outsourced to crowd workers to guarantee the quality level largely depends on the inherent characteristic of the domain: those containing static attributes that are largely redundant from source to source can dramatically simplify domain data detection, extraction and schema matching; an attribute working as a soft identifier across several sources can contribute significantly to reduce the cost of the record linkage task for a domain (i.e, books' ISBN).</p><p>Unfortunately, it turns out that many interesting domains (e.g., job postings, real estates, . . . ) do not exhibit such redundancy and the type of redundancy that the system has to exploit is at an intensional level, i.e., type and format of values, range of values, labels of extracted data. Generally speaking, separating domain data from other information become largely dependent on the context in which the attributes are proposed, and on the availability of human feedback to check the correctness of proposed hypotheses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Redundancy as OpenIE Enabler</head><p>The redundancy plays a fundamental role in our system to keep the crowdsourcing costs at reasonable levels. Whenever redundancy of data across sources is properly detected and exploited, domain data can be discerned by other noisy or out-of-domain information. For example, WEIR <ref type="bibr" target="#b3">[4]</ref> assumes that linkages between collection of pages from two sources are already known as part of the input, and then it exploits the redundancy of distinct and independent sources that publish information about the same objects and attributes to automatically find correct extraction rules and schema matches.</p><p>Noah aims at escalating to the largest possible extent the use of redundancy for extracting and integrating Web data as pioneered by WEIR. It will exploit at least the following forms for redundancy:</p><p>Intensional several sources publish the same domain attributes Extensional several sources publish information about the same domain objects Temporal a source publish data about the same domain objects and attributes over time Intra-source a source can publish data about the same objects in pages of distinct type, e.g., a result page containing snippet of records with most relevant attributes plus link to detail pages containing all attributes <ref type="bibr" target="#b20">[21]</ref> At the same time, and with the help of human feedback, Noah aims at overcoming WEIR's limitations by relaxing its rather strict underlying assumptions on the input domain: WEIR requires that enough intensional and extensional redundancy is available to discern all domain data from all other information. WEIR and Noah falls in the realm of the OpenIE approaches <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b28">29,</ref><ref type="bibr" target="#b32">33,</ref><ref type="bibr" target="#b36">37]</ref>: unlike the ClosedIE approaches <ref type="bibr" target="#b17">[18,</ref><ref type="bibr" target="#b19">20,</ref><ref type="bibr" target="#b24">25,</ref><ref type="bibr" target="#b27">28]</ref> where the managed knowledge base does not grow in terms of subjects and predicates but only in terms of values, new schema information, e.g., new domain attributes, can be progressively discovered while populating the knowledge base with entities and values of schema already known.</p><p>There are two main differences between Noah and other Ope-nIE <ref type="bibr" target="#b28">[29]</ref> systems: first, we do not require a pre-populated Knowledge Base, as we start from an empty IKB and we populate it as new sources over the domain; second, we aim at continuously extracting and integrating data <ref type="bibr" target="#b10">[11]</ref>, as we believe that the temporal setting is important both for business reasons (many projects need continuous stream of data rather than snapshots), and for taking into the main problem definition the maintenance costs of the generated pipelines over time, costs that are largely neglected in many research proposals <ref type="bibr" target="#b28">[29]</ref>.</p><p>Despite many of the problems that need to be tackled to create our pipelines have already been extensively covered in the research literature, we believe that semi-automatizing the creation of Web data processing pipelines can be still considered a relevant problem <ref type="bibr" target="#b9">[10]</ref>.</p><p>We argue that if the costs and the guaranteed level of quality <ref type="bibr" target="#b16">[17]</ref> are explicitly considered, many projects relying on data processing pipelines can be re-conducted into a much more controllable investment and validation process, and their overall feasibility can be significantly improved because many business projects are strongly and directly affected by the cost of creating and maintaining the underlying Web data processing pipelines.</p><p>Moreover, we believe that by posing to human and automatic responders the same type of queries, they become interchangeable enough to motivate the study of new deployment methodologies for Web data processing pipelines. The goal of such methodologies is to progressively lowering the crowdsourcing costs by means of machine-learning techniques while keeping under control the output quality level since the early stages of the deployed pipelines. Indeed, many development projects often experience unpredictable and erratic time-to-market (TTM) and return-oninvestment (ROI) because, especially in the early stages, they adopted ML algorithms but lack the amount and quality of training data, and the validation, needed to guarantee the desired output quality. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">NOAH SYSTEM AND PIPELINES</head><p>The Noah system supports the semi-automatic generation of end-to-end Web data processing pipelines over several domains. Figure <ref type="figure" target="#fig_2">3</ref> shows how the system can generate and operate many pipelines at the same time, each having an IKB that is progressively and continuously populated with data coming from the sources of the domain on which it operates. Our system will interact with external systems by means of two major components: the Crawler, that continuously downloads snapshot of pages from every source with a frequency specified by a cron expression; and the Crowd Manager, that manages the interactions with a crowdsourcing platform.</p><p>During operations, Noah will generate pipeline queries for the responders engaged by the crowdsourcing platform. The responders will contribute to solve the system tasks needed to set up and maintain new pipelines: for example, tasks are needed to select initial extraction rules over every domain source, select and label the source attributes, finding the linkages between source objects to a common mediated domain object, and matching the source attributes across several sources to a mediated domain attribute.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>System Tasks</head><p>The main system tasks that need to be tackled to set up a Noah pipeline are shown in Figure <ref type="figure" target="#fig_3">4</ref>: Page Linkage, Data Extraction, Schema Matching, and Object Linkage.</p><p>Page Linkage aims at obtaining a first approximate top-𝑘 page linkages. Two pages have a linkage if they both publish data related to the same domain object.</p><p>Example 3.1 (Page Linkage). In Figure <ref type="figure" target="#fig_1">2</ref> we can see two possible page linkages at time 𝑡 𝑛 : {(𝑝 1  1 , 𝑝 2 1 ), (𝑝 1 𝑚 , 𝑝 2 4 )}. Their distances, i.e., 0.09 and 0.12, are shown at the top of Figure <ref type="figure" target="#fig_6">5a</ref>.</p><p>Data Extraction aims at finding all the correct extraction rules. It generates all the possible extraction rules and discover the correct ones by exploiting the redundancy of published data across several independent sources <ref type="bibr" target="#b3">[4]</ref> when available, while querying the responders <ref type="bibr" target="#b6">[7]</ref> to confirm uncertain hypotheses.</p><p>Schema Matching aims at finding matches between extraction rules by exploiting an instance-based distance measure between source objects. The instance-based distance between two extraction rules assumes the availability of correct object linkages to align source objects related to the same domain object, as produced in output by the next system task: the distance is obtained Object Linkage aims at finding linkages between source objects by exploiting a pairwise attribute distance measure between source attributes. The pairwise attribute distance between two source objects assumes the availability of correct schema matches across the extraction rules to align source attributes related to the same domain attribute, as produced in output by the previous system task: the distance is obtained by averaging the distance between the two values over all matching attributes.</p><p>We name the linkage/matching loop of system tasks Linkage/Matching Duality; we further discuss it in Section 3.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Pipeline Queries</head><p>For every system task necessary to set up and maintain a pipeline, Noah tries to solve it by using a human-in-the-loop approach <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b25">26]</ref>: unsupervised algorithms will generate most-likely hypothesis based on the available redundancy. These hypothesis are later confuted or validated by means of queries posed to responders, initially only human responders, and later, also by using automatic responders based on ML models that have been trained with the data collected while operating the Noah pipeline (see Section 4).</p><p>An example of the queries posed to the responders for every system task is shown in Figure <ref type="figure" target="#fig_3">4</ref>: Page Linkage, Data Extraction, Schema Matching and Object Linkage.  Example 3.4 (Schema Matching Query). Figure <ref type="figure" target="#fig_3">4</ref> shows that schema matching tasks can be solved by means of queries confirming or refuting a single match: 'Do "108MP" and "20MP" refer to the same attribute of object "MI 10"?'. The template of the query to support a schema matching task has been filled up with values extracted from two pages of distinct sources, e.g., using extraction rules (𝑟 1  5 , 𝑟 2 4 ). These are two detail pages considered in a linkage, and "MI 10" is the name associated with the corresponding domain object.</p><p>Example 3.5 (Page Linkage Query). A query such as 'Do these two pages refer to the same object?' posed to human responders in Figure <ref type="figure" target="#fig_3">4</ref> can validate or refute a page linkage (𝑝 1  𝑚 , 𝑝 2 4 ). In order for the query to be as simple as possible <ref type="bibr" target="#b34">[35]</ref>, we can show the user a screenshot of the original pages.</p><p>Example 3.6 (Object Linkage Query). Unlike the case of page linkage tasks above, here the query is posed directly on source objects with extracted values. A query such as 'Do these 2 objects refer to the same?' posed to human responders in Figure <ref type="figure" target="#fig_3">4</ref> can validate or refute an object linkage (𝑝 1  1 , 𝑝 2 1 ). To make the query as simple as possible for an human responder, it is shown together with two records whose attributes have been already aligned by leveraging the results of a schema matching task.</p><p>The tremendous success of crowdsourcing <ref type="bibr" target="#b23">[24]</ref> can be partially explained by saying that human supervision can represent the essential final ingredient to unmask those problems really hard to solve through automatic algorithms but that can be transformed into rather simple questions for human workers. However, it is well known that in practice, the availability and the accuracy of crowd workers, especially of unskilled ones, is strongly dependent on the way the questions are posed and rewarded <ref type="bibr" target="#b34">[35]</ref>. One of the Noah goal is that of exploiting IKB, which is progressively built, also to make the crowdsourcing queries as simple as possible. For example, a query to check a record linkage exploits the schema matching already computed to make the two records easy to be visually compared.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Linkage / Matching Duality</head><p>Figure <ref type="figure" target="#fig_3">4</ref> shows that two important integration tasks operated by Noah pipelines, i.e., Schema Matching and Object Linkage, are part of a loop in which each one assumes the availability of the output of the other to solve its own task. Page Linkage is the system task outside the loop needed for its initial triggering.</p><p>We assume available two normalized distance functions providing a value between 0 and 1 when comparing two rules, and two source objects (records), respectively: the instance-based distance and the pairwise attribute distance. The former compares two rules over the values they extract from a set of detail pages which have been previously aligned, i.e., their linkages are fixed. ) and (𝑝 1 1 , 𝑝 2 1 ) be two given correct linkages for the detail pages associated with IPhone 12 and MI 10 source object from source 𝑆 1 and 𝑆 2 as shown in Figure <ref type="figure" target="#fig_6">5d</ref>. The distance between the rules (𝑟 1  5 , 𝑟 2 4 ) can be computed as follows:</p><formula xml:id="formula_0">𝑑 (𝑟 1 5 , 𝑟 2 4 ) = 𝑑 (𝑟 1 5 (𝑝 1 1 ), 𝑟 2 4 (𝑝 2 1 )) + 𝑑 (𝑟 1 5 (𝑝 1 𝑚 ), 𝑟 2 4 (𝑝 2 4 )) = 𝑑 ('108MP', '108MP') + 𝑑 ('12MP', '14MP') = 2.9. The normalized distance in the range [0, 1] is 0.27.</formula><p>Pairwise attribute distance: let (𝑟 2 2 , 𝑟 1 2 ) and (𝑟 1 1 , 𝑟 2 1 ) be two given correct matches for Brand and Model attributes (see Figure <ref type="figure" target="#fig_6">5b</ref>). The distance between the two source objects about MI 10 PRO and MI 10T can be computed as follows:</p><formula xml:id="formula_1">𝑑 (𝑜 1 2 , 𝑜 2 2 ) = 𝑑 (𝑟 1 2 (𝑝 1 2 ), 𝑟 2 2 (𝑝 2 2 ))+ 𝑑 (𝑟 1 1 (𝑝 1 2 ), 𝑟 2 1 (𝑝<label>2</label></formula><p>2 )) = 𝑑 ('XIAOMI', 'XIAOMI') +𝑑 ('MI 10 PRO', 'MI 10T') = 3.2. The normalized distance in the range [0, 1] is 0.27.</p><p>We revisit and propose an extension of two domain properties, called Local Consistency and Separable Domain, underlying the formal approach presented in WEIR <ref type="bibr" target="#b3">[4]</ref> for solving the extraction and matching problem when the page linkage is given as input.</p><p>Our ambition is twofold: on the one side, we aim to extend that approach to cover the whole trio of extraction, matching and linkage problems at the core of Noah pipelines; on the other hand, we want to relax the underlying assumptions by mean of the feedback provided by human crowd workers, so making the approach adaptable to domains with more disparate characteristics that those originally covered in the WEIR project. Here we briefly recall the two properties and sketch how we plan to extend them.</p><p>Local Consistency (LC) In a source there cannot be two distinct source attributes that refer to the same domain attribute. The dual property that we additionally assume is that two distinct detail pages from the same source cannot publish data about the same domain object. Separable Domain (SD) In a mapping composed of several extraction rules, each from a distinct source, and associated with the same domain attribute, the instance-based distances between the rules of the mapping are always smaller than distances with rules associated with a different domain attribute. For computing the instance-based distance, the object linkages are fixed and already known.</p><p>The dual property that we additionally assume is that in a linkage composed of several source objects from distinct sources and related to the same domain object, the pairwise attribute distances are always smaller than distances with source objects associated with a different domain object. For computing the pairwise attribute distance, the source attribute matches are fixed and already known.</p><p>For domains in which such properties hold, the WEIR system is able to match the extraction rules and build their mappings into cluster of source attributes related to the same domain attribute by comparing all the similarity distances, while at the same time, it can separate the correct extraction rules from noisy ones. The idea is pretty simple and depicted in Figure <ref type="figure" target="#fig_6">5</ref>: DS suggests to sort the set of all possible matches (pair of extraction rules) by an instance-based distance leveraging the alignment of detail pages (see Figure <ref type="figure" target="#fig_6">5c</ref>). Those pairs are then processed in order of increasing distances: every pair of rules are merged in the same mapping as long as the addition of the rules will not lead to a violation of the LC property, i.e., two rules (source attributes) from the same source would end up being present in the same output mapping (see Figure <ref type="figure" target="#fig_6">5d</ref>). For certain domains, with sufficiently overlapping sources, WEIR can automatically find the correct extraction rules and their matching with rules over other sources provided that the correct linkages between detail pages are known.</p><p>The dual algorithm will solve the problem of finding correct object linkages provided that correct schema matches between source attributes are given as depicted in Figure <ref type="figure" target="#fig_6">5</ref>: DS suggests to sort the set of all possible linkages (pair of source objects) by a pairwise attribute distance (see Figure <ref type="figure" target="#fig_6">5a</ref>). Those pairs are then processed in order of increasing distances: every pair of source objects are merged in the same linkage as long as the addition of the objects processed into an existing linkage will not lead to a violation of the LC property, i.e., two source objects from the same source would end up being present in the same output linkage (see Figure <ref type="figure" target="#fig_6">5b</ref>). This algorithm exploits the duality of the matching and linkage problems, in this setting, and it is at the core of integration engine for the Noah project. However, differently from WEIR, it does not halt the integration as soon as a LC violation is detected: rather, it generates pipeline queries to confirm the choice, and continue the processing of all pairs in increasing order of distances, until it is below a threshold over which no further matches/linkages are expected with meaningful distance functions.</p><p>Unfortunately, as also recognized in WEIR <ref type="bibr" target="#b3">[4]</ref>, some domains have sources and attributes with very similar but semantically different values (e.g., the resolution of the front/rear cameras in Figure <ref type="figure" target="#fig_1">2</ref>). This situation easily lead to violation of the LC and SD assumptions, and finding the mappings is a challenging problem for many interesting domains.</p><p>Example 3.8 (Non-separable Domains for Schema Matching). In Figure <ref type="figure" target="#fig_1">2</ref>, source S 1 and 𝑆 2 both have extraction rules ((𝑟 1  5 , 𝑟 1 6 ), and (𝑟 2  4 , 𝑟 2 5 ), respectively) with a low distance (Figure <ref type="figure" target="#fig_6">5c</ref>) because camera resolutions (e.g., 1-front and 2-back) are typically within a small range of values expressed in megapixel (MP). In Figure <ref type="figure" target="#fig_6">5d</ref> it is shown that the pair of rules (𝑟 1  5 , 𝑟 1  6 ) at distance 0.25 violates the LC and DS assumptions because their distance is smaller than the distance of (𝑟 1  5 , 𝑟 2 4 ) that is 0.27. Actually, it is well known that the Record Linkage dual problem, is even much more challenging than the Schema Matching itself: the attributes containing the correct signals for considering two objects equivalent can change from object to object even within the same source (think at smartphones of different brands with different policies for naming the models and differentiating the features of each model). Assuming that every object in the domain does not lead to a separability violation is quite unrealistic, beside toy cases.</p><p>Example 3.9 (Non-separable Domains for Object Linkage). In Figure <ref type="figure" target="#fig_6">5b</ref> the linkage (𝑝 1  1 , 𝑝 </p><p>2 ) violating the LC property has a pairwise attribute distance of only 0, 09 which is smaller than the distance of a the correct linkage (𝑝 1  1 , 𝑝 2 1 ), and therefore the domain is not separable.</p><p>We believe that the violations of LC and DS assumptions can be manually fixed and that they help to find the most informative pipeline queries that need to be posed to external responders, i.e., paid crowd workers, or suitably trained automatic responders.</p><p>By interleaving the dual linkage/matching algorithms in a loop in which external responders can contribute, as shown in Figure <ref type="figure" target="#fig_3">4</ref>, each execution can contribute to improve the accuracy of the distance function used by the other task, either by improving the linkages used by the instance-based distance, or improving the matches used by the pairwise attribute distance.</p><p>Our vision is that with the precious help of crowdsourcing and a loop of interleaving linkage/matching operations, the desired target quality can be reached even in presence of non-separable domains: responders will be engaged to assess the quality of the output, and to repair the uncertain choices made by the integration algorithm. The linkages and matches confirmed by human feedback can be frozen and exploited in the following iterations, somehow progressively solving and hence removing from the domain the linkages or matches that made the domain inseparable.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">RESEARCH DIRECTIONS</head><p>In the early stages of its life, the IKB K of a new Noah pipeline might be scarcely populated. As redundancy builds up over time with the addition of new sources to feed up the IKB, the accuracy of the extraction and integration process increases.</p><p>The absence of overlapping between objects and attributes published by a rather limited set of sources could limit the amount of available redundancy. In this situation, for operating the pipeline, Noah would end up generating a lot of queries supporting the system tasks. As an alternative solution, Noah supports the incremental addition of a source into an existing pipeline. A new source might contribute to lower the overall costs if it significantly overlaps with the sources already available for the domain <ref type="bibr" target="#b13">[14]</ref>. On the contrary, to integrate new sources publishing new objects or new attributes, additional costs might be incurred to support the integration with existing IKB.</p><p>We are interested to study ML techniques that could decrease crowdsourcing costs even in absence of redundancy. The main research area is that of synthesizing automatic responders capable of answering the same type of pipeline queries that are normally posed to human responders for solving Noah tasks, with the goal of progressively replacing human responders <ref type="bibr" target="#b6">[7]</ref> and scaling the approach up to many thousands of sources.</p><p>Unfortunately, state-of-the-art ML unsupervised techniques <ref type="bibr" target="#b39">[40,</ref><ref type="bibr" target="#b41">42]</ref> can be adapted to provide accurate and reliable answers to those queries only if enough training data have been collected. Indeed, fairness and bias, or simply misuse of machine learning algorithms, is a well-known problem in literature <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b31">32]</ref> that affects many development projects, especially in the scenarios which are most commonly found in practice <ref type="bibr" target="#b37">[38]</ref>: pre-trained ML models and/or enough training data are not available up-front, so that the ML models cannot be properly tuned and exhibit erratic and unpredictable performance <ref type="bibr" target="#b40">[41]</ref>.</p><p>Snorkel <ref type="bibr" target="#b35">[36]</ref> is another project exploiting the idea of leveraging human work to train ML algorithms. However, it is based on the idea of engaging skilled workers in every step of the processing pipeline, while Noah aims at engaging non skilled workers to whom can be interchangeably posed queries in the same form as those posed to automatic responders. Several other projects such as qodco <ref type="bibr" target="#b1">[2]</ref> and SEER <ref type="bibr" target="#b21">[22]</ref> have made use of crowdsourcing by mainly focusing on the problem of selecting the correct extraction rules, while Noah applies the same query control methodology for all the tasks in the considered pipelines.</p><p>It is also well known that by using automatic responders not accurate enough, it might turn out to be more expensive engaging them than not using them at all, as additional human workers should be engaged only to offset their wrong answers <ref type="bibr" target="#b6">[7]</ref>. We envision a system in which crowd workers are used for indirectly controlling the deployment of automatic responders, and the two types of responders are interchangeably engaged. Crowdsourcing workers contribute to collect domain data that are then used to train and evaluate automatic responders, before fully deploying them. Automatic responders will progressively replace crowd workers to scale the approach and to lower the operating costs, but only after enough evidence that their accuracy does not compromise the overall guaranteed output quality data. At regime, crowd workers will be minimally used only to keep monitoring the performance of automatic responders.</p><p>We have identified several novel research challenges:</p><p>• formalizing and proving the correctness of an algorithm that solves the full trio of extraction, matching and linkage tasks; • creating and maintaining over time the continuous Web data processing pipelines at low costs, with guaranteed output quality; • designing several independent automatic responders based on ML models that are capable of answering queries normally posed to crowd workers; • effectively measuring the available redundancy in a domain; • estimating from the characteristics of a domain the crowdsourcing costs necessary to obtain and maintain the desired output quality.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Web detail pages in the Smartphone domain.</figDesc><graphic coords="1,344.38,363.21,162.32,63.41" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Running Example -The Smartphones domain includes 2 sources crawled at 𝑛 instants. Over each source 6 correct extraction rules working on several detail pages are given: 𝑟 𝑖 𝑗 (𝑗 = 1, . . . , 6) denotes the 𝑗-th rule working on source 𝑆 𝑖 , each extracting the value of a source attribute from a detail page associated with a source object. For example, 𝑝 1 3 indicates the page about IPhone 11 from source 𝑆 1 and rule 𝑟 1 2 extracts the Model from every page of the same source. At every time 𝑡, the values extracted from the two sources are conveniently depicted as organized in tables: each row of the table is associated with a detail page of the source, and each column is associated with an extraction rule around the same source. The set of domain attributes includes: Model, Brand, Price, Memory, Camera 1, Camera 2. Correct linkages can be represented as pairs of pages about the same domain objects: {(𝑝 1 1 , 𝑝2 1 ), (𝑝 1 𝑚 , 𝑝 2 4 )}. Correct source attribute matches can be represented as pair of correct extraction rules: {(𝑟 1 1 , 𝑟 2 1 ), (𝑟 1 2 , 𝑟 2 2 ), (𝑟 1 4 , 𝑟 2 3 ), (𝑟 1 5 , 𝑟 2 4 ), (𝑟 1 6 , 𝑟 2 5 )}.</figDesc><graphic coords="2,102.57,83.69,390.15,143.13" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Overview of Noah System &amp; Pipelines created</figDesc><graphic coords="3,344.38,129.43,162.32,103.28" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Running example (Pipeline example with queries): tasks provided by system and query generated for hybrid human-machine responders.</figDesc><graphic coords="4,102.57,83.69,390.14,143.68" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Example 3 . 3 (</head><label>33</label><figDesc>Data Extraction Query).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 4</head><label>4</label><figDesc>shows an example of query for Data Extraction tasks. The uncertainty of an extraction rule generated by wrapper inference can be validated by checking the extracted value on a detail page by means of a query such as: "Is '1050$' a Price?", where Price is a candidate label for the extraction rule and '1050 $' is the extracted value.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Running example (Distance Similarity): 5a and 5c show distances in Pyramids; 5b and 5d expose relations in Cartesian Plane where 'Uncertainties' are due to the breaking of LC with Non-separable Domain</figDesc><graphic coords="5,53.80,84.89,112.17,98.09" type="bitmap" /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">The viability of crowdsourcing for survey research</title>
		<author>
			<persName><forename type="first">Tara</forename><forename type="middle">S</forename><surname>Behrend</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><forename type="middle">J</forename><surname>Sharek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Adam</forename><forename type="middle">W</forename><surname>Meade</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eric</forename><forename type="middle">N</forename><surname>Wiebe</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Behavior research methods</title>
		<imprint>
			<biblScope unit="volume">43</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page">800</biblScope>
			<date type="published" when="2011">2011. 2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Query-oriented data cleaning with oracles</title>
		<author>
			<persName><forename type="first">Moria</forename><surname>Bergman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tova</forename><surname>Milo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Slava</forename><surname>Novgorodov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wang-Chiew</forename><surname>Tan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data</title>
				<meeting>the 2015 ACM SIGMOD International Conference on Management of Data</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="1199" to="1214" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Open information extraction from question-answer pairs</title>
		<author>
			<persName><forename type="first">Nikita</forename><surname>Bhutani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yoshihiko</forename><surname>Suhara</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wang-Chiew</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alon</forename><surname>Halevy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hosagrahar</forename><surname>Visvesvaraya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jagadish</forename></persName>
		</author>
		<idno type="arXiv">arXiv:1903.00172</idno>
		<imprint>
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Extraction and integration of partially overlapping web sources</title>
		<author>
			<persName><forename type="first">Mirko</forename><surname>Bronzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Valter</forename><surname>Crescenzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Paolo</forename><surname>Merialdo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Paolo</forename><surname>Papotti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Proceedings of the VLDB Endowment</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">10</biblScope>
			<biblScope unit="page" from="805" to="816" />
			<date type="published" when="2013">2013. 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Crowdsourcing for data management</title>
		<author>
			<persName><forename type="first">Valter</forename><surname>Crescenzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alvaro</forename><forename type="middle">Aa</forename><surname>Fernandes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Paolo</forename><surname>Merialdo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Norman</forename><forename type="middle">W</forename><surname>Paton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Knowledge and Information Systems</title>
		<imprint>
			<biblScope unit="volume">53</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="41" />
			<date type="published" when="2017">2017. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">RoadRunner: automatic data extraction from data-intensive web sites</title>
		<author>
			<persName><forename type="first">Giansalvatore</forename><surname>Valter Crescenzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Paolo</forename><surname>Mecca</surname></persName>
		</author>
		<author>
			<persName><surname>Merialdo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2002 ACM SIGMOD international conference on Management of data</title>
				<meeting>the 2002 ACM SIGMOD international conference on Management of data</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="624" to="624" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Hybrid Crowd-Machine Wrapper Inference</title>
		<author>
			<persName><forename type="first">Paolo</forename><surname>Valter Crescenzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Disheng</forename><surname>Merialdo</surname></persName>
		</author>
		<author>
			<persName><surname>Qiu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Knowledge Discovery from Data (TKDD)</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="1" to="43" />
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">An analysis of structured data on the web</title>
		<author>
			<persName><forename type="first">Nilesh</forename><surname>Dalvi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ashwin</forename><surname>Machanavajjhala</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bo</forename><surname>Pang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1203.6406</idno>
		<imprint>
			<date type="published" when="2012">2012. 2012</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Human-in-the-Loop Data Analysis: A Personal Perspective</title>
		<author>
			<persName><forename type="first">Anhai</forename><surname>Doan</surname></persName>
		</author>
		<idno type="DOI">10.1145/3209900.3209913</idno>
		<ptr target="https://doi.org/10.1145/3209900.3209913" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Workshop on Human-In-the-Loop Data Analytics (HILDA&apos;18)</title>
				<meeting>the Workshop on Human-In-the-Loop Data Analytics (HILDA&apos;18)<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">1</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Toward a System Building Agenda for Data Integration</title>
		<author>
			<persName><forename type="first">Anhai</forename><surname>Doan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Adel</forename><surname>Ardalan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jeffrey</forename><forename type="middle">R</forename><surname>Ballard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sanjib</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yash</forename><surname>Govind</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Pradap</forename><surname>Konda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Han</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Erik</forename><surname>Paulson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Paul</forename><surname>Suganthan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">C</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Haojun</forename><surname>Zhang</surname></persName>
		</author>
		<idno>arXiv:cs.DB/1710.00027</idno>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Principles of data integration</title>
		<author>
			<persName><forename type="first">Anhai</forename><surname>Doan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alon</forename><surname>Halevy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zachary</forename><surname>Ives</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2012">2012</date>
			<publisher>Elsevier</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">A Few Useful Things to Know about Machine Learning</title>
		<author>
			<persName><forename type="first">Pedro</forename><surname>Domingos</surname></persName>
		</author>
		<idno type="DOI">10.1145/2347736.2347755</idno>
		<ptr target="https://doi.org/10.1145/2347736.2347755" />
	</analytic>
	<monogr>
		<title level="j">Commun. ACM</title>
		<imprint>
			<biblScope unit="volume">55</biblScope>
			<biblScope unit="page" from="78" to="87" />
			<date type="published" when="2012-10">2012. Oct. 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Knowledge vault: A web-scale approach to probabilistic knowledge fusion</title>
		<author>
			<persName><forename type="first">Xin</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Evgeniy</forename><surname>Gabrilovich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Geremy</forename><surname>Heitz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wilko</forename><surname>Horn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ni</forename><surname>Lao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kevin</forename><surname>Murphy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Thomas</forename><surname>Strohmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shaohua</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wei</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining</title>
				<meeting>the 20th ACM SIGKDD international conference on Knowledge discovery and data mining</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="601" to="610" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Less is more: Selecting sources wisely for integration</title>
		<author>
			<persName><forename type="first">Xin</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Barna</forename><surname>Saha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Divesh</forename><surname>Srivastava</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the VLDB Endowment</title>
				<meeting>the VLDB Endowment</meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="37" to="48" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Big data integration</title>
		<author>
			<persName><forename type="first">Xin</forename><surname>Luna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dong</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Divesh</forename><surname>Srivastava</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Synthesis Lectures on Data Management</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="198" />
			<date type="published" when="2015">2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Identifying relations for open information extraction</title>
		<author>
			<persName><forename type="first">Anthony</forename><surname>Fader</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stephen</forename><surname>Soderland</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Oren</forename><surname>Etzioni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2011 conference on empirical methods in natural language processing</title>
				<meeting>the 2011 conference on empirical methods in natural language processing</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="1535" to="1545" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Foundations of data quality management</title>
		<author>
			<persName><forename type="first">Wenfei</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Floris</forename><surname>Geerts</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Synthesis Lectures on Data Management</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="1" to="217" />
			<date type="published" when="2012">2012. 2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">DIADEM: thousands of websites to a single database</title>
		<author>
			<persName><forename type="first">Tim</forename><surname>Furche</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Georg</forename><surname>Gottlob</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Giovanni</forename><surname>Grasso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiaonan</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Giorgio</forename><surname>Orsi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christian</forename><surname>Schallhart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Cheng</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Proceedings of the VLDB Endowment</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">14</biblScope>
			<biblScope unit="page" from="1845" to="1856" />
			<date type="published" when="2014">2014. 2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Corleone: hands-off crowdsourcing for entity matching</title>
		<author>
			<persName><forename type="first">Chaitanya</forename><surname>Gokhale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sanjib</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anhai</forename><surname>Doan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Narasimhan</forename><surname>Jeffrey F Naughton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jude</forename><surname>Rampalli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiaojin</forename><surname>Shavlik</surname></persName>
		</author>
		<author>
			<persName><surname>Zhu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2014 ACM SIGMOD international conference on Management of data</title>
				<meeting>the 2014 ACM SIGMOD international conference on Management of data</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="601" to="612" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Web-scale information extraction with vertex</title>
		<author>
			<persName><forename type="first">Pankaj</forename><surname>Gulhane</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Amit</forename><surname>Madaan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rupesh</forename><surname>Mehta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jeyashankher</forename><surname>Ramamirtham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rajeev</forename><surname>Rastogi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sandeep</forename><surname>Satpal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Srinivasan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ashwin</forename><surname>Sengamedu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Charu</forename><surname>Tengli</surname></persName>
		</author>
		<author>
			<persName><surname>Tiwari</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE 27th International Conference on Data Engineering. IEEE</title>
				<imprint>
			<date type="published" when="2011">2011. 2011</date>
			<biblScope unit="page" from="1209" to="1220" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">RED: Redundancy-Driven Data Extraction from Result Pages?</title>
		<author>
			<persName><forename type="first">Jinsong</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Valter</forename><surname>Crescenzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tim</forename><surname>Furche</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Giovanni</forename><surname>Grasso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Georg</forename><surname>Gottlob</surname></persName>
		</author>
		<idno type="DOI">10.1145/3308558.3313529</idno>
		<ptr target="https://doi.org/10.1145/3308558.3313529" />
	</analytic>
	<monogr>
		<title level="m">The World Wide Web Conference, WWW 2019</title>
				<editor>
			<persName><forename type="first">Ling</forename><surname>Liu</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Ryen</forename><forename type="middle">W</forename><surname>White</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Amin</forename><surname>Mantrach</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Fabrizio</forename><surname>Silvestri</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Julian</forename><forename type="middle">J</forename><surname>Mcauley</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Ricardo</forename><surname>Baeza-Yates</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Leila</forename><surname>Zia</surname></persName>
		</editor>
		<meeting><address><addrLine>San Francisco, CA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2019-05-13">2019. May 13-17, 2019</date>
			<biblScope unit="page" from="605" to="615" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Synthesizing extraction rules from user examples with seer</title>
		<author>
			<persName><forename type="first">Azza</forename><surname>Maeda F Hanafi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Laura</forename><surname>Abouzied</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yunyao</forename><surname>Chiticariu</surname></persName>
		</author>
		<author>
			<persName><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2017 ACM International Conference on Management of Data</title>
				<meeting>the 2017 ACM International Conference on Management of Data</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1687" to="1690" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<title level="m" type="main">How to measure data quality? A metric-based approach</title>
		<author>
			<persName><forename type="first">Bernd</forename><surname>Heinrich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marcus</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mathias</forename><surname>Klier</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2007">2007. 2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">The rise of crowdsourcing</title>
		<author>
			<persName><forename type="first">Jeff</forename><surname>Howe</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Wired magazine</title>
		<imprint>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="1" to="4" />
			<date type="published" when="2006">2006. 2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<title level="m" type="main">Wrapper induction for information extraction</title>
		<author>
			<persName><forename type="first">Nicholas</forename><surname>Kushmerick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Robert</forename><surname>Daniel S Weld</surname></persName>
		</author>
		<author>
			<persName><surname>Doorenbos</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1997">1997</date>
		</imprint>
		<respStmt>
			<orgName>University of Washington Washington</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Human-in-the-Loop Data Integration</title>
		<author>
			<persName><forename type="first">Guoliang</forename><surname>Li</surname></persName>
		</author>
		<idno type="DOI">10.14778/3137765.3137833</idno>
		<ptr target="https://doi.org/10.14778/3137765.3137833" />
	</analytic>
	<monogr>
		<title level="j">Proc. VLDB Endow</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="issue">12</biblScope>
			<biblScope unit="page" from="2006" to="2017" />
			<date type="published" when="2017-08">2017. Aug. 2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Crowdsourced Data Management: Overview and Challenges</title>
		<author>
			<persName><forename type="first">Guoliang</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yudian</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ju</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jiannan</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Reynold</forename><surname>Cheng</surname></persName>
		</author>
		<idno type="DOI">10.1145/3035918.3054776</idno>
		<ptr target="https://doi.org/10.1145/3035918.3054776" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD &apos;17)</title>
				<meeting>the 2017 ACM International Conference on Management of Data (SIGMOD &apos;17)<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1711" to="1716" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<monogr>
		<author>
			<persName><forename type="first">Colin</forename><surname>Lockard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xin</forename><surname>Luna Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Arash</forename><surname>Einolghozati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Prashant</forename><surname>Shiralkar</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1804.04635</idno>
		<title level="m">Ceres: Distantly supervised relation extraction from the semi-structured web</title>
				<imprint>
			<date type="published" when="2018">2018. 2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Openceres: When open information extraction meets the semi-structured web</title>
		<author>
			<persName><forename type="first">Colin</forename><surname>Lockard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Prashant</forename><surname>Shiralkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xin</forename><surname>Luna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dong</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
		<title level="s">Long and Short Papers</title>
		<meeting>the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="3047" to="3056" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Crowdsourced data management: Industry and academic perspectives</title>
		<author>
			<persName><forename type="first">Adam</forename><surname>Marcus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aditya</forename><surname>Parameswaran</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Foundations and Trends in Databases</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="1" to="161" />
			<date type="published" when="2015">2015. 2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Open information extraction systems and downstream applications</title>
		<author>
			<persName><forename type="first">Mausam</forename><surname>Mausam</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the twenty-fifth international joint conference on artificial intelligence</title>
				<meeting>the twenty-fifth international joint conference on artificial intelligence</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="4074" to="4077" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<title level="m" type="main">A survey on bias and fairness in machine learning</title>
		<author>
			<persName><forename type="first">Ninareh</forename><surname>Mehrabi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fred</forename><surname>Morstatter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nripsuta</forename><surname>Saxena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kristina</forename><surname>Lerman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aram</forename><surname>Galstyan</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1908.09635</idno>
		<imprint>
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b32">
	<monogr>
		<author>
			<persName><forename type="first">Christina</forename><surname>Niklaus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Matthias</forename><surname>Cetto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">André</forename><surname>Freitas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Siegfried</forename><surname>Handschuh</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1806.05599</idno>
		<title level="m">A survey on open information extraction</title>
				<imprint>
			<date type="published" when="2018">2018. 2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">A Survey of Approaches to Automatic Schema Matching</title>
		<author>
			<persName><forename type="first">Erhard</forename><surname>Rahm</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Philip</forename><surname>Bernstein</surname></persName>
		</author>
		<idno type="DOI">10.1007/s007780100057</idno>
		<ptr target="https://doi.org/10.1007/s007780100057" />
	</analytic>
	<monogr>
		<title level="j">VLDB J</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="issue">12</biblScope>
			<biblScope unit="page" from="334" to="350" />
			<date type="published" when="2001">2001. 2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">User Interface Design for Crowdsourcing Systems</title>
		<author>
			<persName><forename type="first">Bahareh</forename><surname>Rahmanian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Joseph</forename><forename type="middle">G</forename><surname>Davis</surname></persName>
		</author>
		<idno type="DOI">10.1145/2598153.2602248</idno>
		<ptr target="https://doi.org/10.1145/2598153.2602248" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2014 International Working Conference on Advanced Visual Interfaces (AVI &apos;14</title>
				<meeting>the 2014 International Working Conference on Advanced Visual Interfaces (AVI &apos;14<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="405" to="408" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">Snorkel: Fast training set generation for information extraction</title>
		<author>
			<persName><forename type="first">Stephen</forename><forename type="middle">H</forename><surname>Alexander J Ratner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Henry</forename><forename type="middle">R</forename><surname>Bach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chris</forename><surname>Ehrenberg</surname></persName>
		</author>
		<author>
			<persName><surname>Ré</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2017 ACM international conference on management of data</title>
				<meeting>the 2017 ACM international conference on management of data</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1683" to="1686" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">Open language learning for information extraction</title>
		<author>
			<persName><forename type="first">Michael</forename><surname>Schmitz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stephen</forename><surname>Soderland</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Robert</forename><surname>Bart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Oren</forename><surname>Etzioni</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning</title>
				<meeting>the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning</meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="523" to="534" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<analytic>
		<title level="a" type="main">Democratizing Data Science through Interactive Curation of ML Pipelines</title>
		<author>
			<persName><forename type="first">Zeyuan</forename><surname>Shang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Emanuel</forename><surname>Zgraggen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Benedetto</forename><surname>Buratti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ferdinand</forename><surname>Kossmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Philipp</forename><surname>Eichmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yeounoh</forename><surname>Chung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Carsten</forename><surname>Binnig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eli</forename><surname>Upfal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tim</forename><surname>Kraska</surname></persName>
		</author>
		<idno type="DOI">10.1145/3299869.3319863</idno>
		<ptr target="https://doi.org/10.1145/3299869.3319863" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 International Conference on Management of Data (SIGMOD &apos;19)</title>
				<meeting>the 2019 International Conference on Management of Data (SIGMOD &apos;19)<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="1171" to="1188" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<analytic>
		<title level="a" type="main">Schema-Agnostic Entity Matching using Pre-trained Language Models</title>
		<author>
			<persName><forename type="first">Kai-Sheng</forename><surname>Teong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lay-Ki</forename><surname>Soon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tin Tin</forename><surname>Su</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 29th ACM International Conference on Information &amp; Knowledge Management</title>
				<meeting>the 29th ACM International Conference on Information &amp; Knowledge Management</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="2241" to="2244" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<monogr>
		<title level="m" type="main">Learning to learn</title>
		<author>
			<persName><forename type="first">Sebastian</forename><surname>Thrun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lorien</forename><surname>Pratt</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2012">2012</date>
			<publisher>Springer Science &amp; Business Media</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b40">
	<analytic>
		<title level="a" type="main">Accelerating Human-in-the-Loop Machine Learning: Challenges and Opportunities</title>
		<author>
			<persName><forename type="first">Doris</forename><surname>Xin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Litian</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jialin</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stephen</forename><surname>Macke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shuchen</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aditya</forename><surname>Parameswaran</surname></persName>
		</author>
		<idno type="DOI">10.1145/3209889.3209897</idno>
		<ptr target="https://doi.org/10.1145/3209889.3209897" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning (DEEM&apos;18)</title>
				<meeting>the Second Workshop on Data Management for End-To-End Machine Learning (DEEM&apos;18)<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b41">
	<analytic>
		<title level="a" type="main">Federated machine learning: Concept and applications</title>
		<author>
			<persName><forename type="first">Qiang</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yang</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tianjian</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yongxin</forename><surname>Tong</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Intelligent Systems and Technology (TIST)</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="1" to="19" />
			<date type="published" when="2019">2019. 2019</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
