<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Accountable Data Analytics Start with Accountable Data: The LiQuID Metadata Model</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sarah Oppold</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Melanie Herschel</string-name>
          <email>Melanie.Herschelg@ipvs.uni-stuttgart.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National University of Singapore</institution>
          ,
          <country country="SG">Singapore</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sarah.Oppold</institution>
          ,
          <addr-line>Melanie.Herschel</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Stuttgart</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>59</fpage>
      <lpage>72</lpage>
      <abstract>
        <p>Insights based on data are omnipresent. However, in particular in modern data analytics applications, information about the underlying data often remain obscure, hindering accountable data analytics. Recent e orts have been put into better describing such data based on metadata, similarly to what has been done in various scienti c disciplines for transparent and reproducible research. Based on a detailed study of various metadata standards and proposals, we observe that existing metadata models do not yet su ciently cover information that is relevant for data accountability. To ll this gap, this paper proposes LiQuID, a novel metadata model to make datasets accountable throughout their life cycle. It is more general than existing metadata models, which can be mapped to LiQuID. We validate LiQuID for the purpose of dataset accountability based on a real-world workload we created.</p>
      </abstract>
      <kwd-group>
        <kwd>Metadata Model</kwd>
        <kwd>Accountability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Data underly various insights and decisions today, e.g., for dating
recommendations, marketing decisions, scienti c ndings, or responses to pandemics such as
COVID-19. The result of analyzing these data potentially in uences various
aspects of people's lives. Unfortunately, the development of data analysis pipelines,
that rely on data, is prone to errors. Even though developers of such pipelines
may have the best intentions, mistakes are likely to occur [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. To understand
and account for decisions or insights drawn from data, an important aspect is to
account for the underlying data itself, which includes being transparent about
the creation, handling, purpose, and meaning of the data.
      </p>
      <p>
        Not being aware of properties or intended purpose of data and (possibly
inadvertently) mishandling and misinterpreting the data as a consequence can have
signi cant repercussions. One example is the introduction of discrimination into
decision support systems, as could be observed with the recidivism prediction
system COMPAS [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Another example arises in the COVID-19 pandemic, where
lots of data have been shared in a word-wide e ort to gain insights. However, as
sites like Our World in Data 3 point out, caution has to be applied when
reading reported numbers, as it is often unclear what they mean. For instance, are
number of tests counted as swabs or individuals tested? What is the source of
symptoms reported with cases? Was publishing the data rightful? The datasets
would clearly bene t from accompanying descriptive data, i.e., metadata, to
answer such questions.
      </p>
      <p>
        More generally, governments, ethical review boards, scientists, engineers,
policy makers, and many more stakeholders need to assess and scrutinize data, e.g.,
to determine the appropriateness of the data for their purpose or to ensure that
data are used correctly, ethically, and lawfully. Information pertinent to this
assessment is not included in the data itself, it needs to be provided alongside the
data as metadata. This information makes datasets more transparent, and can
serve as evidence to verify compliance or appropriateness of data with respect
to rules or requirements [
        <xref ref-type="bibr" rid="ref13 ref16">13, 16</xref>
        ]. Thereby we obtain accountable datasets, which
we understand as follows in this paper: Accountable datasets are datasets about
which there is su cient information to justify and explain the actions on these
datasets to a forum of persons, in addition to descriptive information and
information on the people responsible for it. In this paper, we propose to convey the
necessary information in the form of metadata. This information enables dataset
accountability, where all persons responsible for a dataset, i.e., all persons who
have been involved in the life cycle of the dataset, must justify and explain their
actions on the dataset with respect to a set of rules, e.g., laws, contracts, or moral
rules, to a forum of persons in authority. Our notion of dataset accountability
goes beyond information accountability [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], which focuses on the appropriate
use of data, leaving out all other steps in the life cycle of a dataset such as its
creation or maintenance. It also complements algorithmic accountability [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],
which is about the justi cation of entire algorithmic systems.
      </p>
      <p>Clearly, the metadata for accountable datasets are very diverse and broad.
They cover all phases of the life cycle of a dataset (including data collection,
processing, maintenance, and usage) and address di erent aspects (e.g., meaning,
purpose, responsible parties, or ethical considerations). While it is possible to
obtain some of the necessary information when \releasing" the data for further
use, some pieces of information such as design decisions or responsibilities require
collection along the dataset generation process or even prior to its start. Planning
in advance what information should be gathered and incorporating this into the
design process is therefore bene cial for holistically accountable datasets.</p>
      <p>This paper presents a metadata model for accountable dataset that gives a
clear structure on what information is possibly relevant and provides guidance on
what questions to consider when handling datasets. It is systematically designed
along two dimensions: the rst dimension models the di erent phases of the data
life cycle, while the second dimension models essential questions (how, what,
why, etc.) that can be asked about each phase. The information answering each
question in each life cycle step is structured following ve key elds or attributes.
Overall, the metadata model, which we call LiQuID4 is de ned such that it can
accompany any dataset, e.g., from initial data sources to datasets resulting from
complex processing.
4 Name refers to the modeled Life cycle steps, Questions, and Information about Data</p>
      <p>
        There are plenty of existing, highly domain-speci c metadata models that
can be considered candidates for supporting dataset accountability, as they have
been established to make items of interest and corresponding metadata, e.g.,
ndable, accessible, interoperable, reusable, and repeatable [
        <xref ref-type="bibr" rid="ref1 ref11 ref14 ref4">1, 4, 11, 14</xref>
        ], also known
as FAIR principles [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Focusing on datasets as particular items of interest, [
        <xref ref-type="bibr" rid="ref3 ref8 ref9">3,
8, 9</xref>
        ] can be considered emerging approaches towards metadata models for
accountable datasets. However, these have been de ned in a rather ad-hoc fashion.
We show in this paper that LiQuID generalizes the aforementioned existing
metadata models. A detailed study of how existing models t into our
metadata model demonstrates both the appropriateness of LiQuID and the gaps in
existing models in terms of dataset accountability.
      </p>
      <p>To be of practical use, it is important that a metadata model for accountable
datasets covers the metadata necessary for typical questions that arise when
datasets are evaluated or veri ed. Therefore, we determine a real-world query
workload on accountable data, based on analyzing audit literature, the GDPR,
and conducting an expert survey. We observe that LiQuID is the only model we
are aware of that can answer all queries of the workload, validating LiQuID's
completeness. We further see that the workload requires a substantial fraction
(75%) of the elds modeled by LiQuID, indicating its conciseness. No other
metadata model can fully handle the workload, and 10% of elds required by
the workload are not present in any considered existing metadata model.</p>
      <p>In summary, we make the following contributions: (Section 2) a novel
metadata model for accountable datasets, called LiQuID; (Section 3) a detailed
analysis of existing metadata models with respect to dataset accountability that
demonstrates both the appropriateness of our model and the gaps in existing
models for accountable data; and (Section 4) a real-world data accountability
query workload which we use to validate the completeness of LiQuID.
2</p>
      <p>
        LiQuID: a metadata model for accountable data sets
This section presents LiQuID, for which we set the following requirements:
1. Holistic view: The metadata model covers the whole life cycle of a dataset.
2. Systematic structure: A systematic structure o ers a clear guidance on
what information is potentially relevant.
3. Accountability: Following our notion of dataset accountability, the
metadata model should (i) include information on responsible entities (e.g.,
creators, dataset managers) who can be held responsible for the handling of
the data, as well as (ii) leave room for explanations and justi cations in
anticipation of an accountability discussion.
4. Extension: The metadata model builds on existing and time-tested
approaches, maintaining and supporting features that have proven to be
important (e.g., type descriptions, ontologies, FAIR principles [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]).
      </p>
      <p>After describing the general hierarchical metadata model in Section 2.1, we
provide selected details in Section 2.2. Section 2.3 discusses how LiQuID meets
the requirements mentioned above.
Figure 1 summarizes our metadata model for accountable data. It is a
hierarchical model comprising three levels: the top level models the di erent data life
cycle steps, we therefore call it lifecycle level. For each such step, the second
level, named question level, structures metadata according to questions relevant
for accountable data. The information level at the leaf level models the actual
information per life cycle step and question. Note that by extending a general
metadata interface, each element of the metadata model is uniquely identi ed
and may have multiple versions. Note, that a full XSD is also available on our
project website. 5
Life cycle level. In the life cycle level, we consider four essential steps in the life
cycle of a dataset. The rst step is data collection that relates to the creation,
gathering, or capture of the data. Data collection typically involves manual entry
or gathering, automatic capture of data produced through various processes, or
the acquisition of third party data. The data processing step covers all data
manipulations that have altered or transformed the data. Frequently applied data
manipulations during preprocessing include data standardization, data
cleaning, or aggregation. Under data maintenance, we understand the handling of the
dataset once it has been released for further use. This includes a wide range of
data management operations, e.g., updates, additions, deletions of (some) data,
its archival, or destruction. Finally, data usage takes into account past, present
and anticipated activities supported by or applied on the dataset, e.g., input
to machine learning algorithms or distribution to other parties. Although
some5 https://www.ipvs.uni-stuttgart.de/departments/de/research/projects/fat dss/
times seen as separate data life cycle steps, note that we consider information
on data storage and distribution as part of the information on data usage, as
they are essential information when the need to account for proper use of data
arises and are thus jointly queried with the data usage information.
Question level. The second level structures information that LiQuID covers
for every step of the life cycle by commonly used WH-Questions: Why?, Who?,
When?, Where?, How?, What?. This categorization follows the general human
rationale of asking for information about an entity of interest. While this may
appear simplistic, we believe this simplicity allows to easily understand and use
LiQuID. We show later that this structure actually covers all the information
contained in other metadata models - and more, providing evidence that this
intuitive model is nevertheless e ective in covering the required information.
Information level. While the rst two levels essentially serve to contextualize
the information to be provided for accountable data, the information level
organizes the information needed for each life cycle step and question in ve elds.
The rst eld is a description that answers a WH-question for a data life cycle
step. In order to invite explanations and justi cations, which are essential for
accountability, the information level additionally models elds for explanation,
legal considerations, ethical considerations as well as limitations of the answer.
2.2</p>
    </sec>
    <sec id="sec-2">
      <title>Information details</title>
      <p>To get a better understanding of what information our metadata model covers,
Table 2 summarizes what is understood as relevant information for three
combinations of a life cycle step S and a question Q, denoted as S:Q. An exhaustive
description for all combinations is available on our project website. As the
examples in Figure 2 show, we associate a list of questions with each eld of each
S:Q combination. These are intended to help populate the metadata sheets.</p>
      <p>In the subsequent discussion, we focus on the questions to consider when
lling out the information relating to collection:who. We make up a simplistic
example to illustrate the potential content of each eld. The example considers
a hospital that collects case numbers for a particular disease.</p>
      <p>{ Description: Considering the question Who? during data collection, the
description includes information on who (people, organizations) was involved in
the data collection process. It further encompasses any information relevant
to their identi cation, their role in the data collection process, information
necessary to assess their quali cations to ful l this role, and any details
about these people or organizations that may impact the data. In our
example, we would report the hospital and the head of the service responsible for
collecting accurate numbers.
{ Explanation: As part of the explanation, a justi cation on why these
particular people were involved in the data collection process can be provided.
Continuing our example, we explain that this hospital is collecting these
numbers as they are the only medical facility to treat the disease in a larger
area. The responsible person is justi ed by her job description.</p>
      <p>collection:who processing:how usage:where
Description Who (people, organizations) was What was the methodology/ pro- Where is the data set published/
involved in data collection? cedure for data processing? available?
Provide all information relevant to Which methods and tools were Where (place, geographically) can
their identification, their role in used in each step and what was the the published data set be used?
data collection, all information nec- (technical) environment?
essary to assess their qualifications
to fulfill this role, and all
characteristics which could have an influence
on the data set.</p>
      <p>Explanation Why were these particular entities Why was the data processed using Why is the published data set
involved in data collection? this particular methodology/ pro- made available at this place?
cedure, methods, tools and (tech- Why can the published data set be
nical) environment? used at this place?
Legal
consid.</p>
      <p>Ethical
consid.</p>
      <p>Why was it lawful that these people Why was it lawful to process Why is it lawful to publish the
participated in data collection? the data using this methodol- data set at this place?
ogy/ procedure, methods, tools, Why is it lawful to use the
puband(technical) environment? lished data set at this place?
tion had the right to do so, legal considerations recording why it was lawful
that these people were involved in the data collection process are included in
the metadata sheet. In our example, this includes an acknowledgement that
the hospital is legally allowed to collect these data, e.g., based on disease
control regulations.
{ Ethical Considerations:</p>
      <p>We also consider ethical questions, asking why
it was ethically justi able that these people were involved in the data
collection? For instance, if the hospital receives funding depending on the number
of cases, has a con ict of interest been ruled out?
{ Limitations: Finally, the metadata model o ers the possibility to clarify
(i) what limitations in the data set could result from the selection of persons
involved in the data collection (based on their their characteristics or
quali cations available in the description); (ii) what limitations for the overall
objective (Why?) could result from the choice of people; (iii) what e orts
have been made to mitigate the identi ed limitations; or (iv) why there are
no limitations. In our example, a limitation is that the data may lag behind
the actual situation given internal processes at the hospital. But mechanisms
have been put in place to not lag behind by more than 24 hours.
2.3</p>
    </sec>
    <sec id="sec-3">
      <title>Discussion</title>
      <p>Having introduced both requirements for our metadata model and the model
itself, we now review how LiQuID meets the requirements.</p>
      <p>First, the metadata model should o er a holistic view on a dataset,
covering its whole life cycle. This is achieved by the life cycle level of LiQuID that
considers the essential steps of a dataset's life cycle.</p>
      <p>A systematic structure that provides guidance on what information to
consider is given by the overall hierarchical structure of LiQuID. For each data life
cycle step, it asks WH-questions, which are the self-evident human rationale for
assessment. Each question can be answered in a structured way, dictated by the
information level.</p>
      <p>Let us now review how LiQuID supports our accountability requirement.
On the one hand, accountability is supported by asking for responsible entities
which should be described in the Who? question of each life cycle step. On
the other hand, the anticipated accountability discussion has to be modeled
without actually being able to know the questions. But since the questions can
be expected to be critical inquiries of the decisions made in the di erent life cycle
steps, the metadata model encourages to think about such critical questions and
leaves room for responses by providing the information elds for explanation,
legal considerations, ethical considerations, and limitations.</p>
      <p>Finally, the metadata model should be compatible with existing metadata
models by extending these. As we will discuss in detail in Section 3, LiQuID
generalizes existing models, which can be mapped into our metadata model.
We also assume that details modeled by well-established standards, de nitions,
and ontologies are \docked" at the information level, i.e., each eld modeled at
the information level contains further structured elements that are application
dependent.
3</p>
      <sec id="sec-3-1">
        <title>Comparative assessment</title>
        <p>This section compares our metadata model for accountable data with
established, time-tested, and revised metadata models used in various disciplines.
Even though their subject of interest and purpose di er from our metadata
model for accountable datasets, they implicitly represent accumulated knowledge
of what information is deemed important to describe some subject of interest.
More speci cally, we map nine existing metadata models to LiQuID. To this end,
any eld speci ed by an existing model is mapped to the corresponding eld(s)
in LiQuID. Note that we obtain a complete mapping, in the sense that we could
map all information modeled by an existing metadata model to LiQuID.</p>
        <p>
          For our comparative assessment, we choose metadata models with varying
speci city and from various domains. These include two general models [
          <xref ref-type="bibr" rid="ref15 ref5">5, 15</xref>
          ],
four standards to describe an item of interest arising in various domains [
          <xref ref-type="bibr" rid="ref1 ref11 ref14 ref4">1, 11, 4,
14</xref>
          ], and three emerging metadata models for fair, accountable, and transparent
datasets [
          <xref ref-type="bibr" rid="ref3 ref8 ref9">3, 8, 9</xref>
          ].
        </p>
        <p>
          { Dublin Core (DC) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], one very general metadata model we consider is a
conceptual generic model often used as base for other models;
{ W3C PROV (PROV) [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], another very general metadata model
which focuses on describing the lineage of some end product;
{ Describing Archives: A Content Standard (DACS) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] speci es
archiving principles and a metadata model for (aggregations of) archival records
on, e.g., books, reports, or movies;
{ Access to Biological Collection Data (ABCD) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], a metadata model
implementation for biological sample collections;
{ Observations and Measurements (OM) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] emerged from the Open
Geospatial Consortium and de nes a general conceptual schema for
observations and measurements as well as sampling details;
{ Data Documentation Initiative Lifecycle (DDI-L) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], a metadata
implementation describing (groups of) social studies based on questionnaires;
{ Datasheets for Datasets (DS) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] document (personal) datasets
allowing them to be examined for new machine learning applications within the
context of fair machine learning;
{ Data Nutrition Labels (DNL) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] provide automatically generated
modular labels that describe datasets and are intended to enable accountable AI;
{ Data Statements for NLP (DNLP) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] describe spoken or written texts
in order to enable fair natural language processing.
        </p>
        <p>To determine to what extent the existing metadata models cover our model,
we study the existing models in detail and map the entries they specify to
LiQuID. Figure 3 depicts a visualization of this mapping. The columns re ect the
leaves of our hierarchical model (i.e., each column corresponds to an information
eld under a question and life cycle step, information elds being in the order
listed in Section 2.2). Rows lled with colors represent the metadata models
listed above. Rows with label lled with dark color aggregate a group of
metadata models. A colored cell indicates that there is at least one speci ed entry in
the metadata model (row) which corresponds to the respective information eld
in this metadata model (column). Di erent colors are used to distinguish the
di erent life cycle steps to enhance readability. We color the elds generously,
which means elds (i) with only little information on the respective combination
of life cycle step S, question Q, and detail D, denoted S:Q:D or (ii) not
explicitly meant but amenable for the speci c S:Q:D have been colored. If a eld is
left blank, this indicates that there is no entry in the metadata model of the
row (\notes" or \additional comments" set aside) that corresponds to S:Q:D
identi ed by the column. At the end of each row, we also provide a coverage
percentage, calculated as the number of details (cells) covered by a metadata
model, divided by the number of detailed elds in LiQuID.</p>
        <p>Interestingly, Figure 3 shows that both general metadata models cover about
30% of LiQuID. Even combined they only cover 51.7%. Both standards contain
few elds, some of them too general to be mapped to speci c LiQuID elds.</p>
        <p>Figure 3 shows that the lowest coverage of 9% is achieved by OM and DNLP.
The low coverage of OM can be explained, since the standard describes geological
specimen for which an accountability discussion is unlikely. Additionally, these
specimen typically do not undergo processing and maintenance life cycle steps.</p>
        <p>
          More interestingly, we observe that while DNL and DS have higher coverage
than DNLP, the coverage of these metadata models, which have been proposed
with accountability use cases in mind, is generally low. Aggregating them still
only covers around 31% of the details considered in LiQuID. This clearly shows
that while the proposed metadata models may serve well the speci c application
they were engineered for (e.g., information for developers of machine learning
pipelines [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]), they do not provide a general metadata model for accountable
datasets. This is further validated once we consider a query workload over
accountable datasets in Section 4.
        </p>
        <p>Focusing on the domain-speci c metadata-models, we see that their
coverage highly varies between 9% (OM) and 75% (DDI-L). While their individual
coverage may only be moderate, we observe that they cover di erent details to
a di erent degree. Indeed, the standards complement each other, as shown by
the aggregated coverage on this group of 82:5%.</p>
        <p>This high combined coverage of 90.8% shows that many elds included in
LiQuID are already deemed important by existing metadata models. However,
the systematic structure of LiQuID also reveals \blind spots", as it includes
additional elds where information is still missing from any of the considered
metadata models. For instance, we note that while all data life cycle phases are
considered, data maintenance is covered less. However, it is reasonable to assume
that accountability questions on data maintenance arise, for example when
personal data has to be corrected or deleted due to an opt-out of a data subject.
Looking at the questions level, the Why? question is the least covered element,
which is surprising since the management of data should ideally have a goal.
Finally, at the information level, both explanations and ethical considerations
are scarcely covered by the considered existing data models.</p>
        <p>In summary, we observe that existing standards and emerging data models
with accountability use cases in mind, can all be fully mapped to our metadata
model for accountable datasets. The converse does not hold, as LiQuID is not
covered 100% by any considered data model. To understand how relevant the
information that LiQuID covers is for dataset accountability, we determine an
accountability workload and study which information it actually queries.
4</p>
      </sec>
      <sec id="sec-3-2">
        <title>A query workload over accountable datasets</title>
        <p>Ideally, a prede ned benchmark would be used in order to assess the metadata
model objectively. However, we are not aware of any benchmark considering
accountability by including a set of questions or queries which are realistic in datset
accountability scenarios. We therefore contribute a rst such benchmark by
creating a workload of queries on accountable datasets and then assess LiQuID with
respect to this workload. We rst introduce our methodology to create the
workload in Section 4.1. Section 4.2 then discusses how LiQuID ts this workload.
The full workload is also available on our project website.
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Creating the workload</title>
    </sec>
    <sec id="sec-5">
      <title>Sources of real-world accountability questions. To determine a realistic</title>
      <p>workload of queries on metadata models for accountable datasets, sources are
needed that describe existing practices, regulations, and questions that arise in
settings requiring accountability with respect to data.</p>
      <p>
        We identify three such sources. Our rst source comes from the Federal Trade
Commission (FTC) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and establishes a list of guidelines or statements relating
to accountability as part of an in-depth study on data brokers. Data brokers
collect personal data about individuals from di erent sources and sell these data
to companies. In an e ort to create more transparency on data brokers, the
FTC made data brokers answer questions, which they provide in their report.
This shows how regulators conduct real audits and the report includes 101
statements relating to accountability. One sample statement asks to \Provide a list
and description as to the nature and purpose of all the products and services
(both online and o ine) that the Company o ers or sells that use personal data.
Include a separate description of each product or service identi ed[.]".
      </p>
      <p>
        Second, we consider regulations from the General Data Protection
Regulation (GDPR) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which aims at protecting personal data. It is one of the most
restrictive data protection regulations and focuses on data processing and
therefore we expect it to be a tough test for metadata models for accountable datasets.
Beyond clarifying what data protection regulators will test for, it also takes into
account data subjects who have the right to contest who uses data about them.
From the GDPR regulations, we derive possible questions that aim at verifying
the regulations. As an example, consider the following regulation from GDPR
Article 3(2): \This Regulation applies to the processing of personal data of data
subjects who are in the Union by a controller or processor not established in the
Union, where the processing activities are related to: (a) the o ering of goods
or services, irrespective of whether a payment of the data subject is required, to
such data subjects in the Union; or (b) the monitoring of their behaviour as far
as their behaviour takes place within the Union.". We associate this to the
following accountability questions that may be asked when verifying if and how a
regulation applies: \Are the data subjects of the personal data you process in the
European Union? Are the personal data processing activities related to the o
ering of goods or services to data subjects in the European Union? Are the personal
data processing activities related to the monitoring of data subjects behavior that
takes place in the European Union?".
      </p>
      <p>Lastly, we conducted an expert surveyin order to determine questions asked
in dataset assessment deemed relevant by experts. This expert survey extends
beyond personal data, which is the focus of both FTC and GDPR. Ten experts
from various domains (including librarians, data management experts, doctors,
social scientists) participated in the survey. They explained what criteria are
important to them when they assess a dataset and what questions they would
ask to assess these criteria in the di erent data life cycle phases.</p>
      <p>Overall, from these sources, we obtain 183 textual descriptions of what
information is relevant in real-world data accountability scenarios.
From textual descriptions to structured queries. In a next step, we
determine a query language that allows us to query the data corresponding to the
183 textual descriptions, assuming the data are represented hierarchically (as in
our metadata model). Given the textual descriptions, we observe that the query
language needs to support di erent constructs, in particular, conditions,
comparisons, for-loops, and the use of equality constraints. Given these requirements,
we opt to use XQuery as query language, as it meets all requirements.</p>
      <p>Following the questions and statements from the three sources, we derive
XQueries, de ned over an XML Schema that follows our metadata model. That
is, when writing the queries, we determine from which elds of our model the
relevant information can reasonably be retrieved. Note that we do not claim that
our queries cover all possibilities. Also, note that our queries are the result of a
best-e ort approach to resolve ambiguities in questions or statements. To
simplify our queries, we assume further elements nested under the elements de ned
by our metadata model that structure the data. In practice, such elements may
be the result of a domain-speci c ontology for di erent accountability use cases,
as supported by our extension requirement.</p>
      <p>For example, Algorithm 1 shows the XQuery that translates the FTC
statement provided above (repeated in the algorithm's header for convenience). The
color coding indicates semantic correspondences between the text and the query.
First, the query identi es (who?) the Company named \myCompany" who acts
as \Service provider" by o ering or selling products and services. It further
checks that the company uses (what?) personal data (why?) to include in their
products or services. When all these conditions are met, we return the
description of the identi ed product or service, assuming it includes a name and a
description. We explain the purpose of the product or service that processes
personal data.
Algorithm 1: XQuery example derived from the FTC statement
\Provide a list and description as to the nature and purpose of all the products
and services (both online and o ine) that the Company offers or sells
that use personal data. Include a separate description of each product or
service identi ed[.]"
for $x in MDSStore/DataMDS/DataUsage/Who/Information
where $x/Description/Name=\myCompany"
and $x/Description/Type=`Service provider"
and ($x/../../Why/Information/Description/Type=\Product" or</p>
      <p>$x/../../Why/Information/Description/Type=\Service")
and $x/../../What/Information/Description/Type=\Personal data"
return
&lt;Result&gt;
&lt;Name&gt;f$x/../../Why/Information/Description/Nameg&lt;/Name&gt;
&lt;Description&gt;f$x/../../Why/Information/Description/Descg&lt;/Description&gt;
&lt;Purpose&gt;f$x/../../Why/Information/Explanation/Purposeg&lt;/Purpose&gt;
&lt;/Result&gt;
4.2</p>
    </sec>
    <sec id="sec-6">
      <title>Evaluation of the metadata model wrt. the workload</title>
      <p>This section studies how our metadata model supports the real-world
accountability workload obtained as described in the previous section.</p>
      <p>From the 183 textual statements and questions, we were able to express
97% with our metadata model. The remaining 3% are statements that refer
to (i) information on why some measure was not taken, which is not modeled,
because it did not happen, or (ii) unintentional data manipulations, which would
be intentional as soon as they are modeled in the metadata.</p>
      <p>Next, we study which elds of our metadata model are covered by the queries
of our workload. Figure 4 shows the coverage of queries based on FTC, GDPR,
and the expert survey, as well as the coverage when unifying all queries (row
marked with and with black color, ignore ags for now). The visualization is
analogous to the visualization in Figure 3. A eld is colored when at least one
query of the workload refers to it, and coverage is the number of elds referred
to by a workload, divided by the 120 elds available in our metadata model.</p>
      <p>First, we observe that the coverage of workloads of di erent sources varies
between 43.3% for GDPR and 52.2% for the expert survey. However, the
workloads complement each other and, when combined, access 75% of the elds in
our metadata model. While the necessity of 30 elds of our metadata model is
not demonstrated by the workload, we clearly see that a substantial number of
elds not covered by any metadata model devised with accountability scenarios
is relevant in our workload (cf. Figure 3).</p>
      <p>Among the elds accessed when combining all three workloads, we see that
9 of these elds ( agged elds among the black elds in Figure 4) are among
the 11 elds not covered by any other considered metadata model (left white in
Figure 3). This validates that our systematic structure and approach in de ning
the metadata model has contributed to identifying relevant elds not considered
by other metadata models.</p>
      <p>Finally, assuming that any eld that is either accessed by our real-world
accountability workload or has been de ned by a previous metadata model is
relevant, we see that 94.2% of elds modeled by LiQuID are relevant.</p>
      <p>In conclusion, our study of how LiQuID relates to related work and real-world
workloads demonstrates that our metadata model successfully covers a wide
range of accountability queries and generalizes existing metadata models well,
indicating the completeness of the proposed metadata model. At the same time,
it is su ciently concise, as it does not model a signi cant amount of information
for which relevance still needs to be demonstrated.
5</p>
      <sec id="sec-6-1">
        <title>Conclusion and Outlook</title>
        <p>To summarize, we presented a novel metadata model for accountable datasets.
It hierarchically models information relevant in scenarios requiring dataset
accountability, covering di erent steps of the data life cycle, various questions
arising at each step, and structuring the answers based on ve attributes. We
presented a detailed review of metadata models that can be considered
candidates to enable accountable datasets. We observed that our metadata model can
fully cover these, while being more general by modeling additional information.
That this additional information is indeed relevant for accountable datasets is
validated based on a real-world workload of queries arising in dataset
accountability scenarios. Overall, our metadata model is the rst model we are aware of
that is rich enough to answer all queries of the de ned workload.</p>
        <p>
          While the insights gained through the research conducted in this paper are
encouraging for using the proposed metadata model in practice, there are still
quite a few challenges to tackle as part of future research. First, as we experienced
ourselves, informing all elds of the metadata model is a tedious and
timeconsuming task. Therefore, we plan to investigate how to automatically or
semiautomatically ll elds. Another avenue of future research is the integration of
accountable datasets in a larger environment, such as a system for accountable
decision support [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. For this, the metadata about datasets needs to be linked
to metadata collected about other parts of a system to give a holistic view.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <article-title>Access to Biological Collection Data task group: Access to Biological Collection Data (ABCD) (</article-title>
          <year>2007</year>
          ), http://www.tdwg.org/standards/115
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Angwin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Larson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mattu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kirchner</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Machine bias: There's software used across the country to predict future criminals. and it's biased against blacks (</article-title>
          <year>2016</year>
          ), https://www.propublica.org/article/machine-bias
          <article-title>-risk-assessmentsin-criminal-sentencing</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bender</surname>
            ,
            <given-names>E.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Data statements for natural language processing: Toward mitigating system bias and enabling better science</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>6</volume>
          ,
          <issue>587</issue>
          {
          <fpage>604</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Data</given-names>
            <surname>Documentation</surname>
          </string-name>
          <article-title>Initiative: DDI lifecycle 3</article-title>
          .2 (
          <issue>2014</issue>
          ), https://ddialliance.org/Speci cation/DDI-Lifecycle/3.2/
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>DCMI</given-names>
            <surname>Usage</surname>
          </string-name>
          <article-title>Board: DCMI metadata terms (</article-title>
          <year>2020</year>
          ), https://www.dublincore.org/speci cations/dublin-core/dcmi-terms/
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>European</given-names>
            <surname>Parliament</surname>
          </string-name>
          ,
          <article-title>Council of the European Union: Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data</article-title>
          ,
          <source>and repealing Directive</source>
          <volume>95</volume>
          /46/EC (
          <article-title>General Data Protection Regulation) (</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Federal</given-names>
            <surname>Trade</surname>
          </string-name>
          <article-title>Commission: Data brokers: A call for transparency and accountability (</article-title>
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Gebru</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morgenstern</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vecchione</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vaughan</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wallach</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daumee</surname>
            <given-names>III</given-names>
          </string-name>
          , H.,
          <string-name>
            <surname>Crawford</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Datasheets for datasets</article-title>
          .
          <source>In: Proceedings of the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning</source>
          . p.
          <volume>17</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Holland,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Hosny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Newman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Joseph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Chmielinski</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          :
          <article-title>The dataset nutrition label: A framework to drive higher data quality standards</article-title>
          .
          <source>CoRR</source>
          p.
          <volume>21</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Olteanu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Castillo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Diaz</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiciman</surname>
          </string-name>
          , E.:
          <article-title>Social data: Biases, methodological pitfalls, and ethical boundaries</article-title>
          .
          <source>SSRN Electronic Journal</source>
          p.
          <volume>47</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. Open Geospartial Consortium: Observations and
          <string-name>
            <surname>Measurements</surname>
          </string-name>
          (
          <year>2013</year>
          ), https://www.ogc.org/standards/om
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Oppold</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herschel</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A system framework for personalized and transparent data-driven decisions</article-title>
          .
          <source>In: Advanced Information Systems Engineering</source>
          . p.
          <fpage>16</fpage>
          . Springer International Publishing (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cobbe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Norval</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Decision provenance: Harnessing data ow for accountable systems</article-title>
          .
          <source>IEEE Access 7</source>
          ,
          <issue>6562</issue>
          {
          <fpage>6574</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. The Society of American Archivists:
          <article-title>Describing archives: A content standard (</article-title>
          <year>2019</year>
          ), https://saa-ts-dacs.github.io/
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. W3C Working Group:
          <article-title>An overview of the PROV family of documents (</article-title>
          <year>2013</year>
          ), https://www.w3.org/TR/prov-overview/
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Weitzner</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abelson</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feigenbaum</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sussman</surname>
            ,
            <given-names>G.J.:</given-names>
          </string-name>
          <article-title>Information accountability</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>51</volume>
          (
          <issue>6</issue>
          ),
          <volume>82</volume>
          {
          <fpage>87</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Wieringa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>What to account for when accounting for algorithms: A systematic literature review on algorithmic accountability</article-title>
          . In: Fairness, Accountability, and Transparency. p.
          <volume>1</volume>
          {
          <issue>18</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Wilkinson</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.D.</surname>
          </string-name>
          , et al.:
          <article-title>The FAIR guiding principles for scienti c data management and stewardship</article-title>
          .
          <source>Scienti c Data</source>
          <volume>3</volume>
          (
          <issue>1</issue>
          ) (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>