Judith Michael, Victoria Torres (eds.): ER Forum, Demo and Posters 2020            59


    Accountable Data Analytics Start with
Accountable Data: The LiQuID Metadata Model

                         Sarah Oppold1 and Melanie Herschel2,1
                            1
                          University of Stuttgart, Germany
                     2
                     National University of Singapore, Singapore
            {Sarah.Oppold, Melanie.Herschel}@ipvs.uni-stuttgart.de


        Abstract. Insights based on data are omnipresent. However, in partic-
        ular in modern data analytics applications, information about the under-
        lying data often remain obscure, hindering accountable data analytics.
        Recent efforts have been put into better describing such data based on
        metadata, similarly to what has been done in various scientific disci-
        plines for transparent and reproducible research. Based on a detailed
        study of various metadata standards and proposals, we observe that ex-
        isting metadata models do not yet sufficiently cover information that
        is relevant for data accountability. To fill this gap, this paper proposes
        LiQuID, a novel metadata model to make datasets accountable through-
        out their life cycle. It is more general than existing metadata models,
        which can be mapped to LiQuID. We validate LiQuID for the purpose
        of dataset accountability based on a real-world workload we created.

        Keywords: Metadata Model · Accountability


1     Introduction
Data underly various insights and decisions today, e.g., for dating recommenda-
tions, marketing decisions, scientific findings, or responses to pandemics such as
COVID-19. The result of analyzing these data potentially influences various as-
pects of people’s lives. Unfortunately, the development of data analysis pipelines,
that rely on data, is prone to errors. Even though developers of such pipelines
may have the best intentions, mistakes are likely to occur [10]. To understand
and account for decisions or insights drawn from data, an important aspect is to
account for the underlying data itself, which includes being transparent about
the creation, handling, purpose, and meaning of the data.
    Not being aware of properties or intended purpose of data and (possibly inad-
vertently) mishandling and misinterpreting the data as a consequence can have
significant repercussions. One example is the introduction of discrimination into
decision support systems, as could be observed with the recidivism prediction
system COMPAS [2]. Another example arises in the COVID-19 pandemic, where
lots of data have been shared in a word-wide effort to gain insights. However, as
sites like Our World in Data 3 point out, caution has to be applied when read-
ing reported numbers, as it is often unclear what they mean. For instance, are
3
    https://ourworldindata.org/coronavirus


Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
60        S. Oppold and M. Herschel

number of tests counted as swabs or individuals tested? What is the source of
symptoms reported with cases? Was publishing the data rightful? The datasets
would clearly benefit from accompanying descriptive data, i.e., metadata, to an-
swer such questions.
    More generally, governments, ethical review boards, scientists, engineers, pol-
icy makers, and many more stakeholders need to assess and scrutinize data, e.g.,
to determine the appropriateness of the data for their purpose or to ensure that
data are used correctly, ethically, and lawfully. Information pertinent to this as-
sessment is not included in the data itself, it needs to be provided alongside the
data as metadata. This information makes datasets more transparent, and can
serve as evidence to verify compliance or appropriateness of data with respect
to rules or requirements [13, 16]. Thereby we obtain accountable datasets, which
we understand as follows in this paper: Accountable datasets are datasets about
which there is sufficient information to justify and explain the actions on these
datasets to a forum of persons, in addition to descriptive information and infor-
mation on the people responsible for it. In this paper, we propose to convey the
necessary information in the form of metadata. This information enables dataset
accountability, where all persons responsible for a dataset, i.e., all persons who
have been involved in the life cycle of the dataset, must justify and explain their
actions on the dataset with respect to a set of rules, e.g., laws, contracts, or moral
rules, to a forum of persons in authority. Our notion of dataset accountability
goes beyond information accountability [16], which focuses on the appropriate
use of data, leaving out all other steps in the life cycle of a dataset such as its
creation or maintenance. It also complements algorithmic accountability [17],
which is about the justification of entire algorithmic systems.
    Clearly, the metadata for accountable datasets are very diverse and broad.
They cover all phases of the life cycle of a dataset (including data collection,
processing, maintenance, and usage) and address different aspects (e.g., meaning,
purpose, responsible parties, or ethical considerations). While it is possible to
obtain some of the necessary information when “releasing” the data for further
use, some pieces of information such as design decisions or responsibilities require
collection along the dataset generation process or even prior to its start. Planning
in advance what information should be gathered and incorporating this into the
design process is therefore beneficial for holistically accountable datasets.
    This paper presents a metadata model for accountable dataset that gives a
clear structure on what information is possibly relevant and provides guidance on
what questions to consider when handling datasets. It is systematically designed
along two dimensions: the first dimension models the different phases of the data
life cycle, while the second dimension models essential questions (how, what,
why, etc.) that can be asked about each phase. The information answering each
question in each life cycle step is structured following five key fields or attributes.
Overall, the metadata model, which we call LiQuID4 is defined such that it can
accompany any dataset, e.g., from initial data sources to datasets resulting from
complex processing.
4
     Name refers to the modeled Life cycle steps, Questions, and Information about Data
                                              The LiQuID Metadata Model          61

    There are plenty of existing, highly domain-specific metadata models that
can be considered candidates for supporting dataset accountability, as they have
been established to make items of interest and corresponding metadata, e.g., find-
able, accessible, interoperable, reusable, and repeatable [1, 4, 11, 14], also known
as FAIR principles [18]. Focusing on datasets as particular items of interest, [3,
8, 9] can be considered emerging approaches towards metadata models for ac-
countable datasets. However, these have been defined in a rather ad-hoc fashion.
We show in this paper that LiQuID generalizes the aforementioned existing
metadata models. A detailed study of how existing models fit into our meta-
data model demonstrates both the appropriateness of LiQuID and the gaps in
existing models in terms of dataset accountability.
    To be of practical use, it is important that a metadata model for accountable
datasets covers the metadata necessary for typical questions that arise when
datasets are evaluated or verified. Therefore, we determine a real-world query
workload on accountable data, based on analyzing audit literature, the GDPR,
and conducting an expert survey. We observe that LiQuID is the only model we
are aware of that can answer all queries of the workload, validating LiQuID’s
completeness. We further see that the workload requires a substantial fraction
(75%) of the fields modeled by LiQuID, indicating its conciseness. No other
metadata model can fully handle the workload, and 10% of fields required by
the workload are not present in any considered existing metadata model.
    In summary, we make the following contributions: (Section 2) a novel meta-
data model for accountable datasets, called LiQuID; (Section 3) a detailed anal-
ysis of existing metadata models with respect to dataset accountability that
demonstrates both the appropriateness of our model and the gaps in existing
models for accountable data; and (Section 4) a real-world data accountability
query workload which we use to validate the completeness of LiQuID.

2   LiQuID: a metadata model for accountable data sets
This section presents LiQuID, for which we set the following requirements:
 1. Holistic view: The metadata model covers the whole life cycle of a dataset.
 2. Systematic structure: A systematic structure offers a clear guidance on
    what information is potentially relevant.
 3. Accountability: Following our notion of dataset accountability, the meta-
    data model should (i) include information on responsible entities (e.g., cre-
    ators, dataset managers) who can be held responsible for the handling of
    the data, as well as (ii) leave room for explanations and justifications in
    anticipation of an accountability discussion.
 4. Extension: The metadata model builds on existing and time-tested ap-
    proaches, maintaining and supporting features that have proven to be im-
    portant (e.g., type descriptions, ontologies, FAIR principles [18]).
   After describing the general hierarchical metadata model in Section 2.1, we
provide selected details in Section 2.2. Section 2.3 discusses how LiQuID meets
the requirements mentioned above.
62        S. Oppold and M. Herschel


Fig. 1. Overview of LiQuID, showing the different levels of the hierarchical model,
namely the life cycle level , question level , and: information level .


2.1     Metadata model overview

Figure 1 summarizes our metadata model for accountable data. It is a hierar-
chical model comprising three levels: the top level models the different data life
cycle steps, we therefore call it lifecycle level. For each such step, the second
level, named question level, structures metadata according to questions relevant
for accountable data. The information level at the leaf level models the actual
information per life cycle step and question. Note that by extending a general
metadata interface, each element of the metadata model is uniquely identified
and may have multiple versions. Note, that a full XSD is also available on our
project website. 5
Life cycle level. In the life cycle level, we consider four essential steps in the life
cycle of a dataset. The first step is data collection that relates to the creation,
gathering, or capture of the data. Data collection typically involves manual entry
or gathering, automatic capture of data produced through various processes, or
the acquisition of third party data. The data processing step covers all data ma-
nipulations that have altered or transformed the data. Frequently applied data
manipulations during preprocessing include data standardization, data clean-
ing, or aggregation. Under data maintenance, we understand the handling of the
dataset once it has been released for further use. This includes a wide range of
data management operations, e.g., updates, additions, deletions of (some) data,
its archival, or destruction. Finally, data usage takes into account past, present
and anticipated activities supported by or applied on the dataset, e.g., input
to machine learning algorithms or distribution to other parties. Although some-
5
     https://www.ipvs.uni-stuttgart.de/departments/de/research/projects/fat dss/
                                             The LiQuID Metadata Model         63

times seen as separate data life cycle steps, note that we consider information
on data storage and distribution as part of the information on data usage, as
they are essential information when the need to account for proper use of data
arises and are thus jointly queried with the data usage information.
Question level. The second level structures information that LiQuID covers
for every step of the life cycle by commonly used WH-Questions: Why?, Who?,
When?, Where?, How?, What?. This categorization follows the general human
rationale of asking for information about an entity of interest. While this may
appear simplistic, we believe this simplicity allows to easily understand and use
LiQuID. We show later that this structure actually covers all the information
contained in other metadata models - and more, providing evidence that this
intuitive model is nevertheless effective in covering the required information.
Information level. While the first two levels essentially serve to contextualize
the information to be provided for accountable data, the information level orga-
nizes the information needed for each life cycle step and question in five fields.
The first field is a description that answers a WH-question for a data life cycle
step. In order to invite explanations and justifications, which are essential for
accountability, the information level additionally models fields for explanation,
legal considerations, ethical considerations as well as limitations of the answer.

2.2   Information details

To get a better understanding of what information our metadata model covers,
Table 2 summarizes what is understood as relevant information for three com-
binations of a life cycle step S and a question Q, denoted as S.Q. An exhaustive
description for all combinations is available on our project website. As the ex-
amples in Figure 2 show, we associate a list of questions with each field of each
S.Q combination. These are intended to help populate the metadata sheets.
     In the subsequent discussion, we focus on the questions to consider when
filling out the information relating to collection.who. We make up a simplistic
example to illustrate the potential content of each field. The example considers
a hospital that collects case numbers for a particular disease.

 – Description: Considering the question Who? during data collection, the de-
   scription includes information on who (people, organizations) was involved in
   the data collection process. It further encompasses any information relevant
   to their identification, their role in the data collection process, information
   necessary to assess their qualifications to fulfil this role, and any details
   about these people or organizations that may impact the data. In our exam-
   ple, we would report the hospital and the head of the service responsible for
   collecting accurate numbers.
 – Explanation: As part of the explanation, a justification on why these par-
   ticular people were involved in the data collection process can be provided.
   Continuing our example, we explain that this hospital is collecting these
   numbers as they are the only medical facility to treat the disease in a larger
   area. The responsible person is justified by her job description.
64        S. Oppold and M. Herschel

            collection.who                           processing.how                    usage.where
Description Who (people, organizations) was What was the methodology/ pro- Where is the data set published/
            involved in data collection?             cedure for data processing?       available?
            Provide all information relevant to Which methods and tools were Where (place, geographically) can
            their identification, their role in used in each step and what was the the published data set be used?
            data collection, all information nec- (technical) environment?
            essary to assess their qualifications
            to fulfill this role, and all character-
            istics which could have an influence
            on the data set.
Explanation Why were these particular entities Why was the data processed using Why is the published data set
            involved in data collection?             this particular methodology/ pro- made available at this place?
                                                     cedure, methods, tools and (tech- Why can the published data set be
                                                     nical) environment?               used at this place?
Legal
            Why was it lawful that these people Why was it lawful to process Why is it lawful to publish the
consid.
            participated in data collection?         the data using this methodol- data set at this place?
                                                     ogy/ procedure, methods, tools, Why is it lawful to use the pub-
                                                     and(technical) environment?       lished data set at this place?
Ethical
            Why was it ethically justifiable Why was it ethically justifi- Why is it ethically justifiable to
consid.
            that these people were involved in able to process the data using publish the data set at this place?
            the data collection process?             this procedure, methods, tools, Why is it ethically justifiable to
                                                     and(technical) environment?       use the data set at this place?
Limitations What general limitations in the What general limitations in the What general limitations for data
            data set could result from the selec- data set could result from this set usage could result from the se-
            tion of entities involved in the data choice of methodology/ procedure, lection of place where the data set
            set collection and their characteris- methods, tools, and (technical) en- is made available?
            tics or qualifications?                  vironment of the data processing? Where should the published data
            What limitations for the overall ob- What limitations for the over- set not be used?
            jective (Why?) could result from all objective (Why?) could result
            the choice of people?                    from this choice?
            What efforts have been made to What efforts have been made to
            mitigate the identified problems mitigate the identified problems
            and limitations?                         and limitations?
            Why are there no limitations?            Why are there no limitations?


        Fig. 2. Examples of information covered for different S.Q combinations.


 – Legal Considerations: To ensure that the persons involved in data collec-
   tion had the right to do so, legal considerations recording why it was lawful
   that these people were involved in the data collection process are included in
   the metadata sheet. In our example, this includes an acknowledgement that
   the hospital is legally allowed to collect these data, e.g., based on disease
   control regulations.
 – Ethical Considerations: We also consider ethical questions, asking why
   it was ethically justifiable that these people were involved in the data collec-
   tion? For instance, if the hospital receives funding depending on the number
   of cases, has a conflict of interest been ruled out?
 – Limitations: Finally, the metadata model offers the possibility to clarify
   (i) what limitations in the data set could result from the selection of persons
   involved in the data collection (based on their their characteristics or qual-
   ifications available in the description); (ii) what limitations for the overall
   objective (Why?) could result from the choice of people; (iii) what efforts
   have been made to mitigate the identified limitations; or (iv) why there are
   no limitations. In our example, a limitation is that the data may lag behind
   the actual situation given internal processes at the hospital. But mechanisms
   have been put in place to not lag behind by more than 24 hours.


2.3     Discussion

Having introduced both requirements for our metadata model and the model
itself, we now review how LiQuID meets the requirements.
                                               The LiQuID Metadata Model           65

    First, the metadata model should offer a holistic view on a dataset, cover-
ing its whole life cycle. This is achieved by the life cycle level of LiQuID that
considers the essential steps of a dataset’s life cycle.
    A systematic structure that provides guidance on what information to con-
sider is given by the overall hierarchical structure of LiQuID. For each data life
cycle step, it asks WH-questions, which are the self-evident human rationale for
assessment. Each question can be answered in a structured way, dictated by the
information level.
    Let us now review how LiQuID supports our accountability requirement.
On the one hand, accountability is supported by asking for responsible entities
which should be described in the Who? question of each life cycle step. On
the other hand, the anticipated accountability discussion has to be modeled
without actually being able to know the questions. But since the questions can
be expected to be critical inquiries of the decisions made in the different life cycle
steps, the metadata model encourages to think about such critical questions and
leaves room for responses by providing the information fields for explanation,
legal considerations, ethical considerations, and limitations.
    Finally, the metadata model should be compatible with existing metadata
models by extending these. As we will discuss in detail in Section 3, LiQuID
generalizes existing models, which can be mapped into our metadata model.
We also assume that details modeled by well-established standards, definitions,
and ontologies are “docked” at the information level, i.e., each field modeled at
the information level contains further structured elements that are application
dependent.


3    Comparative assessment
This section compares our metadata model for accountable data with estab-
lished, time-tested, and revised metadata models used in various disciplines.
Even though their subject of interest and purpose differ from our metadata
model for accountable datasets, they implicitly represent accumulated knowledge
of what information is deemed important to describe some subject of interest.
More specifically, we map nine existing metadata models to LiQuID. To this end,
any field specified by an existing model is mapped to the corresponding field(s)
in LiQuID. Note that we obtain a complete mapping, in the sense that we could
map all information modeled by an existing metadata model to LiQuID.
    For our comparative assessment, we choose metadata models with varying
specificity and from various domains. These include two general models [5, 15],
four standards to describe an item of interest arising in various domains [1, 11, 4,
14], and three emerging metadata models for fair, accountable, and transparent
datasets [3, 8, 9].
 – Dublin Core (DC) [5], one very general metadata model we consider is a
   conceptual generic model often used as base for other models;
 – W3C PROV (PROV) [15], another very general metadata model
   which focuses on describing the lineage of some end product;
66     S. Oppold and M. Herschel


 Fig. 3. Mapping between LiQuID (columns) and other existing metadata models.


 – Describing Archives: A Content Standard (DACS) [14] specifies archiv-
   ing principles and a metadata model for (aggregations of) archival records
   on, e.g., books, reports, or movies;
 – Access to Biological Collection Data (ABCD) [1], a metadata model
   implementation for biological sample collections;
 – Observations and Measurements (OM) [11] emerged from the Open
   Geospatial Consortium and defines a general conceptual schema for obser-
   vations and measurements as well as sampling details;
 – Data Documentation Initiative Lifecycle (DDI-L) [4], a metadata
   implementation describing (groups of) social studies based on questionnaires;
 – Datasheets for Datasets (DS) [8] document (personal) datasets allow-
   ing them to be examined for new machine learning applications within the
   context of fair machine learning;
 – Data Nutrition Labels (DNL) [9] provide automatically generated mod-
   ular labels that describe datasets and are intended to enable accountable AI;
 – Data Statements for NLP (DNLP) [3] describe spoken or written texts
   in order to enable fair natural language processing.

    To determine to what extent the existing metadata models cover our model,
we study the existing models in detail and map the entries they specify to
LiQuID. Figure 3 depicts a visualization of this mapping. The columns reflect the
leaves of our hierarchical model (i.e., each column corresponds to an information
field under a question and life cycle step, information fields being in the order
listed in Section 2.2). Rows filled with colors represent the metadata models
listed above. Rows with label ∗ filled with dark color aggregate a group of meta-
data models. A colored cell indicates that there is at least one specified entry in
the metadata model (row) which corresponds to the respective information field
                                             The LiQuID Metadata Model         67

in this metadata model (column). Different colors are used to distinguish the
different life cycle steps to enhance readability. We color the fields generously,
which means fields (i) with only little information on the respective combination
of life cycle step S, question Q, and detail D, denoted S.Q.D or (ii) not explic-
itly meant but amenable for the specific S.Q.D have been colored. If a field is
left blank, this indicates that there is no entry in the metadata model of the
row (“notes” or “additional comments” set aside) that corresponds to S.Q.D
identified by the column. At the end of each row, we also provide a coverage
percentage, calculated as the number of details (cells) covered by a metadata
model, divided by the number of detailed fields in LiQuID.
    Interestingly, Figure 3 shows that both general metadata models cover about
30% of LiQuID. Even combined they only cover 51.7%. Both standards contain
few fields, some of them too general to be mapped to specific LiQuID fields.
    Figure 3 shows that the lowest coverage of 9% is achieved by OM and DNLP.
The low coverage of OM can be explained, since the standard describes geological
specimen for which an accountability discussion is unlikely. Additionally, these
specimen typically do not undergo processing and maintenance life cycle steps.
    More interestingly, we observe that while DNL and DS have higher coverage
than DNLP, the coverage of these metadata models, which have been proposed
with accountability use cases in mind, is generally low. Aggregating them still
only covers around 31% of the details considered in LiQuID. This clearly shows
that while the proposed metadata models may serve well the specific application
they were engineered for (e.g., information for developers of machine learning
pipelines [9]), they do not provide a general metadata model for accountable
datasets. This is further validated once we consider a query workload over ac-
countable datasets in Section 4.
    Focusing on the domain-specific metadata-models, we see that their cover-
age highly varies between 9% (OM) and 75% (DDI-L). While their individual
coverage may only be moderate, we observe that they cover different details to
a different degree. Indeed, the standards complement each other, as shown by
the aggregated coverage on this group of 82.5%.
    This high combined coverage of 90.8% shows that many fields included in
LiQuID are already deemed important by existing metadata models. However,
the systematic structure of LiQuID also reveals “blind spots”, as it includes
additional fields where information is still missing from any of the considered
metadata models. For instance, we note that while all data life cycle phases are
considered, data maintenance is covered less. However, it is reasonable to assume
that accountability questions on data maintenance arise, for example when per-
sonal data has to be corrected or deleted due to an opt-out of a data subject.
Looking at the questions level, the Why? question is the least covered element,
which is surprising since the management of data should ideally have a goal.
Finally, at the information level, both explanations and ethical considerations
are scarcely covered by the considered existing data models.
    In summary, we observe that existing standards and emerging data models
with accountability use cases in mind, can all be fully mapped to our metadata
68      S. Oppold and M. Herschel

model for accountable datasets. The converse does not hold, as LiQuID is not
covered 100% by any considered data model. To understand how relevant the
information that LiQuID covers is for dataset accountability, we determine an
accountability workload and study which information it actually queries.


4     A query workload over accountable datasets

Ideally, a predefined benchmark would be used in order to assess the metadata
model objectively. However, we are not aware of any benchmark considering ac-
countability by including a set of questions or queries which are realistic in datset
accountability scenarios. We therefore contribute a first such benchmark by cre-
ating a workload of queries on accountable datasets and then assess LiQuID with
respect to this workload. We first introduce our methodology to create the work-
load in Section 4.1. Section 4.2 then discusses how LiQuID fits this workload.
The full workload is also available on our project website.


4.1   Creating the workload

Sources of real-world accountability questions. To determine a realistic
workload of queries on metadata models for accountable datasets, sources are
needed that describe existing practices, regulations, and questions that arise in
settings requiring accountability with respect to data.
    We identify three such sources. Our first source comes from the Federal Trade
Commission (FTC) [7] and establishes a list of guidelines or statements relating
to accountability as part of an in-depth study on data brokers. Data brokers
collect personal data about individuals from different sources and sell these data
to companies. In an effort to create more transparency on data brokers, the
FTC made data brokers answer questions, which they provide in their report.
This shows how regulators conduct real audits and the report includes 101 state-
ments relating to accountability. One sample statement asks to “Provide a list
and description as to the nature and purpose of all the products and services
(both online and offline) that the Company offers or sells that use personal data.
Include a separate description of each product or service identified[.]”.
    Second, we consider regulations from the General Data Protection Regula-
tion (GDPR) [6], which aims at protecting personal data. It is one of the most
restrictive data protection regulations and focuses on data processing and there-
fore we expect it to be a tough test for metadata models for accountable datasets.
Beyond clarifying what data protection regulators will test for, it also takes into
account data subjects who have the right to contest who uses data about them.
From the GDPR regulations, we derive possible questions that aim at verifying
the regulations. As an example, consider the following regulation from GDPR
Article 3(2): “This Regulation applies to the processing of personal data of data
subjects who are in the Union by a controller or processor not established in the
Union, where the processing activities are related to: (a) the offering of goods
or services, irrespective of whether a payment of the data subject is required, to
                                              The LiQuID Metadata Model          69

such data subjects in the Union; or (b) the monitoring of their behaviour as far
as their behaviour takes place within the Union.”. We associate this to the fol-
lowing accountability questions that may be asked when verifying if and how a
regulation applies: “Are the data subjects of the personal data you process in the
European Union? Are the personal data processing activities related to the offer-
ing of goods or services to data subjects in the European Union? Are the personal
data processing activities related to the monitoring of data subjects behavior that
takes place in the European Union?”.
   Lastly, we conducted an expert surveyin order to determine questions asked
in dataset assessment deemed relevant by experts. This expert survey extends
beyond personal data, which is the focus of both FTC and GDPR. Ten experts
from various domains (including librarians, data management experts, doctors,
social scientists) participated in the survey. They explained what criteria are
important to them when they assess a dataset and what questions they would
ask to assess these criteria in the different data life cycle phases.
   Overall, from these sources, we obtain 183 textual descriptions of what in-
formation is relevant in real-world data accountability scenarios.
From textual descriptions to structured queries. In a next step, we de-
termine a query language that allows us to query the data corresponding to the
183 textual descriptions, assuming the data are represented hierarchically (as in
our metadata model). Given the textual descriptions, we observe that the query
language needs to support different constructs, in particular, conditions, compar-
isons, for-loops, and the use of equality constraints. Given these requirements,
we opt to use XQuery as query language, as it meets all requirements.
    Following the questions and statements from the three sources, we derive
XQueries, defined over an XML Schema that follows our metadata model. That
is, when writing the queries, we determine from which fields of our model the
relevant information can reasonably be retrieved. Note that we do not claim that
our queries cover all possibilities. Also, note that our queries are the result of a
best-effort approach to resolve ambiguities in questions or statements. To sim-
plify our queries, we assume further elements nested under the elements defined
by our metadata model that structure the data. In practice, such elements may
be the result of a domain-specific ontology for different accountability use cases,
as supported by our extension requirement.
    For example, Algorithm 1 shows the XQuery that translates the FTC state-
ment provided above (repeated in the algorithm’s header for convenience). The
color coding indicates semantic correspondences between the text and the query.
First, the query identifies (who?) the Company named “myCompany” who acts
as “Service provider” by offering or selling products and services. It further
checks that the company uses (what?) personal data (why?) to include in their
products or services. When all these conditions are met, we return the descrip-
tion of the identified product or service, assuming it includes a name and a
description. We explain the purpose of the product or service that processes
personal data.
70        S. Oppold and M. Herschel


 Algorithm 1: XQuery example derived from the FTC statement “Pro-
 vide a list and description as to the nature and purpose of all the products
 and services (both online and offline) that the Company offers or sells
 that use personal data. Include a separate description of each product or
 service identified[.]”
      for $x in MDSStore/DataMDS/DataUsage/Who/Information
      where $x/Description/Name=“myCompany”
      and $x/Description/Type=‘Service provider”
      and ($x/../../Why/Information/Description/Type=“Product” or
            $x/../../Why/Information/Description/Type=“Service”)
      and $x/../../What/Information/Description/Type=“Personal data”
      return
      <Result>
      <Name>{$x/../../Why/Information/Description/Name}</Name>
      <Description>{$x/../../Why/Information/Description/Desc}</Description>
      <Purpose>{$x/../../Why/Information/Explanation/Purpose}</Purpose>
      </Result>


4.2     Evaluation of the metadata model wrt. the workload

This section studies how our metadata model supports the real-world account-
ability workload obtained as described in the previous section.
    From the 183 textual statements and questions, we were able to express
97% with our metadata model. The remaining 3% are statements that refer
to (i) information on why some measure was not taken, which is not modeled,
because it did not happen, or (ii) unintentional data manipulations, which would
be intentional as soon as they are modeled in the metadata.
    Next, we study which fields of our metadata model are covered by the queries
of our workload. Figure 4 shows the coverage of queries based on FTC, GDPR,
and the expert survey, as well as the coverage when unifying all queries (row
marked with ∗ and with black color, ignore flags for now). The visualization is
analogous to the visualization in Figure 3. A field is colored when at least one
query of the workload refers to it, and coverage is the number of fields referred
to by a workload, divided by the 120 fields available in our metadata model.
    First, we observe that the coverage of workloads of different sources varies
between 43.3% for GDPR and 52.2% for the expert survey. However, the work-
loads complement each other and, when combined, access 75% of the fields in
our metadata model. While the necessity of 30 fields of our metadata model is
not demonstrated by the workload, we clearly see that a substantial number of
fields not covered by any metadata model devised with accountability scenarios
is relevant in our workload (cf. Figure 3).
    Among the fields accessed when combining all three workloads, we see that
9 of these fields (flagged fields among the black fields in Figure 4) are among
the 11 fields not covered by any other considered metadata model (left white in
Figure 3). This validates that our systematic structure and approach in defining
                                               The LiQuID Metadata Model          71


                             Fig. 4. Workload coverage.


the metadata model has contributed to identifying relevant fields not considered
by other metadata models.
     Finally, assuming that any field that is either accessed by our real-world
accountability workload or has been defined by a previous metadata model is
relevant, we see that 94.2% of fields modeled by LiQuID are relevant.
     In conclusion, our study of how LiQuID relates to related work and real-world
workloads demonstrates that our metadata model successfully covers a wide
range of accountability queries and generalizes existing metadata models well,
indicating the completeness of the proposed metadata model. At the same time,
it is sufficiently concise, as it does not model a significant amount of information
for which relevance still needs to be demonstrated.


5   Conclusion and Outlook

To summarize, we presented a novel metadata model for accountable datasets.
It hierarchically models information relevant in scenarios requiring dataset ac-
countability, covering different steps of the data life cycle, various questions
arising at each step, and structuring the answers based on five attributes. We
presented a detailed review of metadata models that can be considered candi-
dates to enable accountable datasets. We observed that our metadata model can
fully cover these, while being more general by modeling additional information.
That this additional information is indeed relevant for accountable datasets is
validated based on a real-world workload of queries arising in dataset account-
ability scenarios. Overall, our metadata model is the first model we are aware of
that is rich enough to answer all queries of the defined workload.
    While the insights gained through the research conducted in this paper are
encouraging for using the proposed metadata model in practice, there are still
quite a few challenges to tackle as part of future research. First, as we experienced
ourselves, informing all fields of the metadata model is a tedious and time-
consuming task. Therefore, we plan to investigate how to automatically or semi-
automatically fill fields. Another avenue of future research is the integration of
accountable datasets in a larger environment, such as a system for accountable
decision support [12]. For this, the metadata about datasets needs to be linked
to metadata collected about other parts of a system to give a holistic view.
72      S. Oppold and M. Herschel

References
 1. Access to Biological Collection Data task group: Access to Biological Collection
    Data (ABCD) (2007), http://www.tdwg.org/standards/115
 2. Angwin, J., Larson, J., Mattu, S., Kirchner, L.: Machine bias: There’s soft-
    ware used across the country to predict future criminals. and it’s biased against
    blacks (2016), https://www.propublica.org/article/machine-bias-risk-assessments-
    in-criminal-sentencing
 3. Bender, E.M., Friedman, B.: Data statements for natural language processing:
    Toward mitigating system bias and enabling better science. Transactions of the
    Association for Computational Linguistics 6, 587–604 (2018)
 4. Data       Documentation        Initiative:    DDI       lifecycle     3.2   (2014),
    https://ddialliance.org/Specification/DDI-Lifecycle/3.2/
 5. DCMI         Usage       Board:        DCMI        metadata        terms     (2020),
    https://www.dublincore.org/specifications/dublin-core/dcmi-terms/
 6. European Parliament, Council of the European Union: Regulation (EU) 2016/679
    of the European Parliament and of the Council of 27 April 2016 on the protection
    of natural persons with regard to the processing of personal data and on the free
    movement of such data, and repealing Directive 95/46/EC (General Data Protec-
    tion Regulation) (2016)
 7. Federal Trade Commission: Data brokers: A call for transparency and accountabil-
    ity (2014)
 8. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H.,
    Daumeé III, H., Crawford, K.: Datasheets for datasets. In: Proceedings of the 5th
    Workshop on Fairness, Accountability, and Transparency in Machine Learning.
    p. 17 (2018)
 9. Holland, S., Hosny, A., Newman, S., Joseph, J., Chmielinski, K.: The dataset nu-
    trition label: A framework to drive higher data quality standards. CoRR p. 21
    (2018)
10. Olteanu, A., Castillo, C., Diaz, F., Kiciman, E.: Social data: Biases, methodological
    pitfalls, and ethical boundaries. SSRN Electronic Journal p. 47 (2016)
11. Open Geospartial Consortium: Observations and Measurements (2013),
    https://www.ogc.org/standards/om
12. Oppold, S., Herschel, M.: A system framework for personalized and transpar-
    ent data-driven decisions. In: Advanced Information Systems Engineering. p. 16.
    Springer International Publishing (2020)
13. Singh, J., Cobbe, J., Norval, C.: Decision provenance: Harnessing data flow for
    accountable systems. IEEE Access 7, 6562–6574 (2019)
14. The Society of American Archivists: Describing archives: A content standard
    (2019), https://saa-ts-dacs.github.io/
15. W3C Working Group: An overview of the PROV family of documents (2013),
    https://www.w3.org/TR/prov-overview/
16. Weitzner, D.J., Abelson, H., Berners-Lee, T., Feigenbaum, J., Hendler, J., Suss-
    man, G.J.: Information accountability. Communications of the ACM 51(6), 82–87
    (2008)
17. Wieringa, M.: What to account for when accounting for algorithms: A systematic
    literature review on algorithmic accountability. In: Fairness, Accountability, and
    Transparency. p. 1–18 (2020)
18. Wilkinson, M.D., et al.: The FAIR guiding principles for scientific data manage-
    ment and stewardship. Scientific Data 3(1) (2016)