<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bridging the Gap between Buyers and Sellers in Data Marketplaces with Personalized Datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Donatella Firmani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jerin George Mathew</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Donatello Santoro</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Simonini</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Zecchini</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sapienza University of Rome</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Basilicata</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Modena and Reggio Emilia</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Sharing, discovering, and integrating data is a crucial task and poses many challenging spots and open research direction. Data owners need to know what data consumers want and data consumers need to ifnd datasets that are satisfactory for their tasks. Several data market platforms, or data marketplaces (DMs), have been used so far to facilitate data transactions between data owners and customers. However, current DMs are mostly shop windows, where customers have to rely on metadata that owners manually curate to discover useful datasets and there is no automated mechanism for owners to determine if their data could be merged with other datasets to satisfy customers' desiderata. The availability of novel artificial intelligence techniques for data management has sparked a renewed interest in proposing new DMs that stray from this conventional paradigm and overcome its limitations. This paper envisions a conceptual framework called DataStreet where DMs can create personalized datasets by combining available datasets and presenting summarized statistics to help users make informed decisions. In our framework, owners share some of their data with a trusted DM, and customers provide a dataset template to fuel content-based (rather than metadata-based) search queries. Upon each query, the DM creates a preview of the personalized dataset through a flexible use of dataset discovery, integration, and value measurement, while ensuring owners' fair treatment and preserving privacy. The previewed datasets might not be pre-defined in the DM and are finally materialized upon successful transaction.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data Market</kwd>
        <kwd>Data Integration</kwd>
        <kwd>Fairness</kwd>
        <kwd>Privacy</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Data marketplaces (DMs) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] are digital platforms where data buyers and sellers can interact to
exchange data. These platforms have become increasingly popular in recent years, which saw
an increasing demand for huge amounts of data by practitioners—e.g., to train their machine
learning (ML) models. Conventional DMs act as storefronts for data vendors, with buyers
relying only on the provided metadata, documentation, and of-the-shelf sample to determine
whether the data is useful for them. This model, which puts solely on the buyer the burden of
determining the usefulness of the data and defining how to combine multiple datasets, presents
significant shortcomings.
      </p>
      <p>First, the discovery of valuable data is based only on metadata, documentation, and
occasionally small data samples. As the content of data sources is undisclosed and not indexed,
buyers cannot search by content and thus cannot determine if the data on sale aligns with their
proprietary data. Secondly, the potential of the data is limited, as a single data source might
not meet the buyer’s need—while multiple data sources from diferent sellers might succeed
instead. Nonetheless, there is no mechanism for sellers to securely share a portion of the data
to determine beforehand if it suits the buyer’s needs. Finally, potential buyers cannot test the
data before buying it, making it impossible for them to explore or manipulate the data on sale
to select the best data source for their needs (e.g., by computing a feature correlation matrix to
determine if it would be an appropriate training set for their model).</p>
      <p>Challenges. An enduring challenge in the DM field is the definition of a horizontal framework
for implementing the operations described above to support the matching between the supply
of sellers and the demand of buyers. Recently, few works addressed this issue by proposing new
models that deviate from the conventional DM-as-a-shop window paradigm to reduce the time
and efort required by searching and preparing datasets for both buyers and sellers, streamlining
the process of price negotiation, promoting trust in data management, and facilitating data
trading across sectors and countries.</p>
      <p>Our approach. Following the aforementioned line of research that strays from the usual
DM-as-a-shopwindow paradigm, we propose a novel conceptual framework called DataStreet,
which aims at facilitating matching buyers’ needs with sellers’ ofers. DataStreet envisions
a new DM model where sellers share their data with a trusted DM, with the assurance that
it will only be disclosed after an agreement with the buyer. Buyers provide the DM with
a template of the dataset they wish to purchase and can specify quality metrics the dataset
should meet. The DM then creates personalized datasets by combining sellers’ datasets and
presents summarized statistics to help buyers to select the dataset that best suits their needs.
The concept of personalized datasets is crucial to our system and entails an integrated view
over various datasets in the DM that fit with the buyer’s query. These personalized datasets are
only materialized upon a completed transaction, resulting in no additional storage overhead.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        We can identify three related lines of work that our contribution builds upon and extends:
Conventional approaches to DMs. The concept of data vendors and public data has been
present since the beginning of the web; however, in recent years there has been a remarkable
surge in the volume of generated and processed data. The practice of trading data has long
caught the interest of both economists and the database community, with discussions on this
topic going on for decades [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. As a result, several proposals for data market platforms gained
significant popularity in enabling interactions between data buyers and sellers. Some of these
proposals can be regarded as barter data markets, in which online providers ofer services (such
as email, web searches, and social networks) to users in exchange for their data. On the other
hand, broker data markets gather anonymized customer data and sell it to marketers, allowing
for more relevant ads to be delivered to consumers with better measurement (e.g., Acxiom1,
Experian2). In both cases, the market controller has a dominant position and typically retains
most of the exchange benefits. Lately, online DMs have emerged [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] with the aim of enhancing
the value of data for a wider range of participants. These platforms are primarily designed
to connect data sellers and buyers, serving as intermediaries to facilitate data exchanges and
transactions. This two-sided nature of the platforms allows organizations with a high demand
for data to extract value from the available datasets that are ofered for purchase.
Limitations of traditional approaches to DMs. One of the main shortcomings of current
DMs (e.g., AWS Marketplace3, Dawex4, Snowflake 5, etc.) is that they act as a storefront, where
sellers can display their data and buyers can browse a centralized collection to find appealing
datasets. Under this conventional model, the responsibility of integrating and matching relevant
data purchased from multiple providers to meet the desired target is on the buyer. In other
words, suppliers ofer consumers “raw data”, leaving the task of efectively converting this raw
data into valuable insights to the consumers, who might not dispose of the required expertise
or resources. We debate that this imbalance between providers and customers is one of the
primary factors restricting DM adoption. It is also important to note that several data markets
are focused on specific industries (such as martech, automotive, and energy) and data types (like
spatio-temporal data or data sourced from IoT sensors). These niche DMs are more dificult to
locate and access compared to general-purpose DMs. In most situations, buyers might have to
interact with multiple platforms to fulfill their requirements [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Furthermore, every transaction
requires a one-on-one negotiation, resulting in increased final costs and, more significantly,
making it challenging for the buyer to determine the quality of the final integrated dataset.
Recent approaches to improve DMs. Recent vision papers such as [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] recognize that
moving away from the DM-as-a-shopwindow paradigm is a crucial step towards the next
generation of DMs. These papers emphasize the importance of data sharing, discovery, and
integration, and propose an intermediary framework between sellers and buyers known as
arbiter or broker. However, we note that these works have limited support in terms of (i)
realtime performance, (ii) revenue sharing among sellers, (iii) reducing the efort on the consumer
side, and (iv) enforcing responsibility by design. The DataStreet framework envisions the use
of recent advancements in data management based on the application of artificial intelligence
tools, such as [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ], to enable the aforementioned advancements while also addressing the
current limitations of these vision papers.
      </p>
      <sec id="sec-2-1">
        <title>1https://www.acxiom.com</title>
        <p>2https://www.experian.in
3https://aws.amazon.com/data-exchange
4https://www.dawex.com
5https://www.snowflake.com/en/data-cloud/marketplace
Buyer</p>
        <p>Query
Negotiation DataStreet</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. The DataStreet Framework</title>
      <p>For the sake of clarity, the discussion mainly focuses on structured tabular data, where the
terms attributes and features are used interchangeably to refer to table columns, while the terms
records and instances are used to refer to table rows. However, the proposed framework can also
be adapted to semi-structured data with minor modifications. Our DM paradigm, illustrated in
Figure 1, comprises four primary steps.</p>
      <p>The first one involves sellers sharing with the broker details of their dataset features along
with a selection or projection of instances. If sellers lack confidence in sharing certain parts of
the dataset, they are handled separately using privacy-preserving mechanisms. The second step
involves buyers submitting to the broker a query that includes a view of the desired dataset
along with a combination of attribute names and preferences, which can be expressed in both
structured data and natural language—e.g., a query could be the following:</p>
      <sec id="sec-3-1">
        <title>Target Schema: City, Country, Salary, Age and MortgageApproval.</title>
        <p>Preferences: recent data, with at least half from USA and ≥ 20% from New York
Buyers can also search for instances, e.g., a given city, country, salary/age range, and mortgage
status, by including them in the query. The third step involves the broker returning a preview
of diferent datasets that match the query. Each preview contains a sample of the dataset’s
instances, depending on the disclosure preferences of the sellers. The datasets are ranked based
on a utility function that ensures fair treatment of sellers and provides adequate economic
opportunities. The fourth and final step involves each dataset having an estimated value used
to determine the price. If one of the datasets satisfies the buyer, the broker proposes a fair share
to all sellers: a transaction is completed if they are satisfied with the share.</p>
        <p>It should be noted that the datasets pre-defined in the DM might not immediately match
the buyer’s preferences, while multiple datasets might need to be integrated to accommodate
them. Thus, in the DataStreet framework, the broker identifies all relevant datasets owned
Target schema: City, Country, Salary, Age and MortageApproval
Preferences: recent data, with at least half from USA and ≥20% from New York
1</p>
        <p>Discovery
2 Integration
by various sellers and combines them into a preview that aligns with the buyer’s preferences.
These merged datasets are referred to as personalized datasets and are only created upon the
successful completion of a transaction, requiring no additional storage overhead. To illustrate
the framework’s architecture, we present a diagram in Figure 2, along with a group of sample
datasets that will be used as a reference throughout the discussion.</p>
        <p>Figure 2 depicts a scenario where a buyer B submits a query to build an ML classifier that
predicts whether a mortgage request is accepted or rejected. The label or class for each instance
is represented by the MortgageApproval attribute. Additionally, the figure displays three
datasets, D1, D2, and D3, belonging to three sellers, S1, S2, and S3, respectively. Each dataset
has distinct attributes, with D1 containing information about City, Income, DateOfBirth, and
SSN, D2 presenting the AnonymizedSSN, BankAccount, and MortgageApproval attributes, and
D3 consisting of the SSN and BankAccount features.</p>
        <p>Our conceptual framework consists of three main blocks, dubbed ○ 1 Personalized Data
Discovery, ○ 2 Personalized Data Integration, and ○ 3 Negotiation. The first module will discover
the datasets D1 and D2 as relevant for the target schema. Features do not need to appear exactly
as specified. For instance, Salary might appear under the name Income and in place of Age we
might have DateOfBirth, from which we can infer the Age column. The second module will
merge and integrate the datasets with the help of other datasets in the DM. For example, since
the SSN attribute of D1 cannot be directly linked to the AnonymizedSSN attribute of D2, we
ifrst need to link D2 to D3 via a BankAccount attribute in order to get a SSN-MortageApproval
association. Both the BankAccount and the SSN attributes might contain errors. This module
will also identify a sample of instances that can be previewed and rank the resulting personalized
datasets with respect to potential other datasets. Finally, the third module will be responsible
for assessing the value of the personalized dataset and sharing the revenue among the sellers.
Value assessment will be based on the quality of the personalized dataset – including matching
accuracy and seller reputation – and possibly on the results of testing a model trained on the
dataset, if the buyer specifies so. Revenue sharing will include all the sellers contributing to
the dataset, both with actual rows and columns (i.e., S1 and S2) and with auxiliary information
(e.g. S3 – in fact, none of the data owned by S3 appears in the personalized dataset). While in
current DMs a buyer has to search in the catalogs of existing datasets and the efort of discovery
is split between the buyer (that has to assess the relevance of each dataset) and the sellers (that
have to make their datasets discoverable), DMs equipped with our framework will have a major
advantage to solve the problem of matching ofer and demand with low user efort, without
compromising on privacy and fairness. We will now discuss the modules in more detail.</p>
        <sec id="sec-3-1-1">
          <title>3.1. Personalized Data Discovery</title>
          <p>
            In our DM model, each buyer’s query has a target schema, consisting of a set of attributes
that may not readily align with the datasets owned by the sellers. Attributes might appear
transformed or under a diferent name and some of them might be owned by diferent sellers,
hence requiring a subsequent linkage of the instances [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]. Thus, the first step in DataStreet is to
automatically discover the relevant sources and to find mappings that match the target schema.
Research challenge 1. Equally valid solutions to the mapping problem can have diferent
correspondences with the buyer’s subjective preferences.
          </p>
          <p>
            In our running example, preferences are related to the freshness of data and some attribute
values. Other preferences might include a constraint on the model accuracy or the presence of
specific rows in the data. Continuing the example, there might be in the DM multiple datasets
with the same features as D2, dubbed D2.1, D2.2, ..., D2.n, each with a diferent distribution of
cities, ages and labels. This module is responsible for identifying D1 and the datasets among
D2.1, D2.2, ..., D2.n that when combined with D1 can result in a dataset D* that better matches
with the preferences specified in the buyer’s query. In principle, this can be achieved even
using simple preference criteria, such as instance diversity [
            <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
            ]. However, enabling the
framework to incorporate subjective preferences (i.e., those expressed by buyers in their own
words, such as “recent data”) is known to be challenging tasks. In fact, despite the existence of
several research directions in data discovery [
            <xref ref-type="bibr" rid="ref12 ref13 ref14 ref6">12, 6, 13, 14</xref>
            ], which are capable of dealing with
natural language ambiguity and the diverse ways in which humans represent information, none
of these techniques can handle subjective data . While these methods can assist in identifying
multiple relevant datasets, it is up to the end users to select a subset that meets their needs. In
our proposed framework we aim to reduce human efort, which is currently beyond the scope
of the existing algorithms.
          </p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.2. Personalized Data Integration</title>
          <p>This module is responsible of producing a ranking of datasets according to the specifications of
the buyers. It is important to recall that the datasets returned by the broker may not match any
data that currently exists in the DM catalog. Additionally, the datasets presented in the ranking
may contain only a sample of the instances, in accordance with the data disclosure preferences
of sellers. Thus, we must integrate discovered datasets from previous steps into previews.
Research challenge 2. Executing data integration on the entire raw data and subsequently
applying sampling methods to generate the preview may not be computationally feasible.</p>
          <p>
            This module must incorporate appropriate data integration methods that can minimize the
computation time by only executing the necessary operations for generating the preview of the
personalized datasets. These methods must deliver real-time results to a buyer’s query as if the
personalized datasets were already available and materialized in the DM. To achieve this, we
will build on existing data integration [
            <xref ref-type="bibr" rid="ref15 ref16 ref17 ref18">15, 16, 17, 18</xref>
            ] as well as other recent techniques, such as
join paths [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ] to devise a custom indexing systems to ensure rapid access to a collection of
potentially relevant views in the DM, while still protecting data privacy. This group of potential
data views will be represented as a knowledge graph (KG) in DataStreet and will be maintained
using KG completion techniques based on representation learning, such as those in [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ].
Research challenge 3. Another crucial aspect in this context is preserving the privacy of the
data during the linkage (i.e., integration) step.
          </p>
          <p>
            In fact, attributes such as SSN and BankAccount in Figure 2, which serve as identifiers or
quasiidentifiers in our running example, play a crucial role in locating records in diferent datasets.
However, due to their sensitive nature, some sellers may be reluctant to share these attributes
values with the DM broker. To address this issue, the Personalized Data Integration module
will incorporate appropriate privacy-preserving record linkage (PPRL) techniques to meet the
diverse privacy requirements of various sellers. PPRL techniques enable data linkage while
safeguarding sensitive information against being exposed during the linkage process. Notably,
there exists a vast array of PPRL methods in the literature, which leverage techniques such as
cryptography [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ], ML [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ], and probabilistic data structures [
            <xref ref-type="bibr" rid="ref23 ref24">23, 24</xref>
            ]; however, these methods
can be challenging to use and require substantial efort on the part of the data owner [
            <xref ref-type="bibr" rid="ref25">25</xref>
            ].
          </p>
          <p>Finally, there can be multiple ways to generate a personalized dataset D* that satisfies a
buyer’s query, and therefore they need to be ordered based on utility criteria such as value,
accuracy, and cost. Placing a dataset in a top position increases the chances of a data transaction.
Research challenge 4. Regardless of how utility is defined, ranking systems have a
responsibility not only to buyers but also to sellers being ranked.</p>
          <p>Even small diferences in utility can result in significant diferences in economic opportunities
across seller groups. These groups can be based on demographic categories or any other category
at risk of systematic discrimination. To achieve our vision, it is essential to ensure fair exposure
of sellers in the DM without compromising the utility of the ranking for the buyer.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.3. Value Assessment and Revenue Sharing</title>
          <p>
            Although personalized datasets can reduce the cost of finding useful data in a DM, negotiating
price and overcoming information asymmetry between buyers and sellers remains a major
challenge. The value of a dataset may difer between a seller and a buyer. For instance, sellers
may consider the efort they put in acquiring and preparing the data, while buyers might be
interested in how much the data will enhance a process and its overall value. Negotiations can
significantly afect real-world markets, and the notion of dataset value is fundamental for data
discovery, integration, and ranking. Pricing data is a complex issue that has been studied in
various fields such as economics, law, and business [
            <xref ref-type="bibr" rid="ref26">26</xref>
            ]; however, to the best of our knowledge,
there has been a limited emphasis in a data-driven approaches to facilitate appropriate pricing.
Research challenge 5. Develop data-driven measures that enable transactions between buyers
and sellers without explicitly determining the individual components of the transaction price.
          </p>
          <p>In this context, we identify three main actions.</p>
          <p>
            Determining the value of the data. Measures of intrinsic value, based on traditional data
quality literature such as freshness and completeness [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ], as well as more recent social-minded
measures [
            <xref ref-type="bibr" rid="ref28 ref29">28, 29</xref>
            ], become important for D* only if buyers specify their preferences in their
query. On the other hand, extrinsic measures can accommodate diferent sellers’ needs and
expectations. Examples include the buyers’ demand for a dataset and the sellers’ reputation.
Defining revenue sharing schemes. If all sellers contribute to the personalized dataset
D* with instances, we could use game-theory measures such as the Shapley value [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ] to
allocate revenue to each row. However, this is not always the case, as shown in our example
where datasets D1 and D2 have diferent identifier attributes, SSN and AnonymizedSSN, which
cannot be directly linked. Therefore, we need to use a dataset from another seller, S3, which
owns the attribute BankAccount and serves as an auxiliary identifier to disambiguate SSN
and AnonymizedSSN. Even though S3 does not contribute to D* with any row or column, it
plays a crucial role. To address this, our vision can be accomplished by developing sharing
algorithms based on the traditional notion of provenance [
            <xref ref-type="bibr" rid="ref31">31</xref>
            ]. We will also draw upon the
concepts of necessity and suficiency from the field of explainable ML [
            <xref ref-type="bibr" rid="ref32 ref33">32, 33</xref>
            ]. In the context of
DM, instances can be considered necessary if their absence results in the model’s inability to
produce the desired prediction, and attributes suficiently important if they contribute to the
model’s ability to make accurate predictions.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions and Future Works</title>
      <p>In this paper, we proposed a conceptual framework called DataStreet to address the limitations
of conventional data marketplaces (DMs) by enabling personalized dataset creation through
a flexible use of dataset discovery, integration, and data value measurement. Our framework
allows DMs to combine sellers’ data into personalized datasets and present summarized statistics
to help buyers in making informed decisions, while also ensuring fair treatment of sellers and
privacy. Personalized datasets are created upon successful transactions and may correspond to
combination of multiple datasets in the DM.</p>
      <p>The proposed framework is still in the conceptual stage, and there are several technical
challenges, arising from such a novel user-centric perspective. As illustrated in Section 3,
state-of-the-art techniques are designed to be applied in scenarios where the value of data
artifacts is somewhat objective, as opposed to DataStreet, where the value of the data artifact
is also subjective and depends on the buyers’ and sellers’ perspectives. We believe that such
extension might enable critical human-in-the-loop interactions to ensure that the final product
is tailored to the needs and requirements of all the involved parties.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was supported in part by the Sapienza Research grant B83C22007180001 and the
SEED PNR 2021 grant FLOWER. Jerin George Mathew is financed by the Italian National PhD
Program in AI.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Azcoitia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Laoutaris</surname>
          </string-name>
          ,
          <article-title>A survey of data marketplaces and their business models</article-title>
          ,
          <source>ACM SIGMOD Record</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Koutris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Upadhyaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Balazinska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Howe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Suciu</surname>
          </string-name>
          ,
          <article-title>Query-based data pricing</article-title>
          ,
          <source>JACM</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Schomm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Stahl</surname>
          </string-name>
          , G. Vossen,
          <article-title>Marketplaces for data: An initial survey</article-title>
          ,
          <source>ACM SIGMOD Record</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R. Castro</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Subramaniam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Franklin</surname>
          </string-name>
          ,
          <article-title>Data market platforms: Trading data assets to solve data problems</article-title>
          ,
          <source>PVLDB</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kennedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Subramaniam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Galhotra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. C.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <article-title>Revisiting online data markets in 2022</article-title>
          , ACM SIGMOD Record (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. Castro</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Mansour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Qahtan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elmagarmid</surname>
          </string-name>
          , I. Ilyas,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ouzzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Seeping semantics: Linking datasets using word embeddings for data discovery</article-title>
          ,
          <source>in: ICDE</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Suhara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <article-title>Deep entity matching with pre-trained language models</article-title>
          ,
          <source>PVLDB</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E. K.</given-names>
            <surname>Rezig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhandari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fariha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Price</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vanterpool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gadepally</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Stonebraker, DICE: Data discovery by example</article-title>
          ,
          <source>PVLDB</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Mecca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Santoro</surname>
          </string-name>
          , E. Teniente,
          <article-title>Ontology-based mappings</article-title>
          ,
          <source>DKE</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Asudeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nargesian</surname>
          </string-name>
          ,
          <article-title>Towards distribution-aware query answering in data markets</article-title>
          ,
          <source>PVLDB</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mecella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Scannapieco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Batini</surname>
          </string-name>
          ,
          <article-title>On the meaningfulness of “big data quality”</article-title>
          ,
          <source>DSE</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bogatu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. W.</given-names>
            <surname>Paton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Konstantinou</surname>
          </string-name>
          ,
          <article-title>Dataset discovery in data lakes</article-title>
          ,
          <source>in: ICDE</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R. Castro</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Abedjan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Koko</surname>
          </string-name>
          , G. Yuan,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          ,
          <article-title>Aurum: A data discovery system</article-title>
          ,
          <source>in: ICDE</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Giuzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mecca</surname>
          </string-name>
          , E. Quintarelli,
          <string-name>
            <given-names>M.</given-names>
            <surname>Roveri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Santoro</surname>
          </string-name>
          , L. Tanca,
          <string-name>
            <surname>Indiana:</surname>
          </string-name>
          <article-title>An interactive system for assisting database exploration</article-title>
          ,
          <source>Information Systems</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Sequeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Miranker</surname>
          </string-name>
          ,
          <article-title>A pay-as-you-go methodology for ontology-based data access</article-title>
          ,
          <source>IEEE Internet Computing</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Saha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Online entity resolution using an oracle</article-title>
          ,
          <source>PVLDB</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>G.</given-names>
            <surname>Simonini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gagliardelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bergamaschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          ,
          <article-title>Scaling entity resolution: A loosely schema-aware approach</article-title>
          ,
          <source>Information Systems</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>G.</given-names>
            <surname>Simonini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zecchini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bergamaschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Entity</given-names>
            <surname>Resolution</surname>
          </string-name>
          On-Demand,
          <source>Proceedings of the VLDB Endowment (PVLDB) 15</source>
          (
          <year>2022</year>
          )
          <fpage>1506</fpage>
          -
          <lpage>1518</lpage>
          . doi:
          <volume>10</volume>
          .14778/ 3523210.3523226.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>P.</given-names>
            <surname>Ouellette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sciortino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nargesian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. G.</given-names>
            <surname>Bashardoost</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Ronin: Data lake exploration</article-title>
          ,
          <source>PVLDB</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Barbosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Matinata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Merialdo</surname>
          </string-name>
          ,
          <article-title>Knowledge graph embedding for link prediction: A comparative analysis</article-title>
          ,
          <source>ACM TKDD</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Inan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kantarcioglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ghinita</surname>
          </string-name>
          , E. Bertino,
          <article-title>Private record matching using diferential privacy</article-title>
          ,
          <source>in: EDBT</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>R.</given-names>
            <surname>Shokri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stronati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Shmatikov</surname>
          </string-name>
          ,
          <article-title>Membership inference attacks against machine learning models</article-title>
          , in: IEEE SP, IEEE,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cormode</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Muthukrishnan</surname>
          </string-name>
          ,
          <article-title>Approximating data with the count-min sketch</article-title>
          , IEEE software (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cormode</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <article-title>A unifying framework for L0-sampling algorithms</article-title>
          ,
          <source>Distributed Parallel Databases</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>M.</given-names>
            <surname>Franke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sehili</surname>
          </string-name>
          , E. Rahm,
          <article-title>Primat: A toolbox for fast privacy-preserving matching</article-title>
          ,
          <source>PVLDB</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Carriere-Swallow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Haksar</surname>
          </string-name>
          ,
          <article-title>The economics and implications of data: an integrated perspective</article-title>
          ,
          <source>International Monetary Fund</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>F.</given-names>
            <surname>Geerts</surname>
          </string-name>
          , G. Mecca,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Santoro</surname>
          </string-name>
          ,
          <article-title>Cleaning data with llunatic</article-title>
          ,
          <source>The VLDB Journal</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>E.</given-names>
            <surname>Pitoura</surname>
          </string-name>
          ,
          <article-title>Social-minded measures of data quality: fairness, diversity, and lack of bias</article-title>
          ,
          <source>ACM JDIQ</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tanca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Torlone</surname>
          </string-name>
          ,
          <article-title>Ethical dimensions for data quality</article-title>
          ,
          <source>ACM JDIQ</source>
          (
          <year>2020</year>
          ). URL: https://doi.org/10.1145/3362121. doi:
          <volume>10</volume>
          .1145/3362121.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Shapley</surname>
          </string-name>
          ,
          <article-title>Quota solutions of n-person games</article-title>
          ,
          <source>Technical Report</source>
          ,
          <year>1952</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Davidson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          ,
          <article-title>Provenance and scientific workflows: challenges and opportunities</article-title>
          , in: SIGMOD,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Watson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gultchin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Taly</surname>
          </string-name>
          , L. Floridi,
          <article-title>Local explanations via necessity and suficiency: Unifying theory and practice</article-title>
          ,
          <source>in: Uncertainty in Artificial Intelligence</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>T.</given-names>
            <surname>Teofili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Koudas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Martello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Merialdo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Efective explanations for entity resolution models</article-title>
          ,
          <source>in: ICDE</source>
          ,
          <year>2022</year>
          . URL: https://doi.org/10.1109/ICDE53745.
          <year>2022</year>
          .
          <volume>00248</volume>
          . doi:
          <volume>10</volume>
          .1109/ICDE53745.
          <year>2022</year>
          .
          <volume>00248</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>