<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Accountable Data Analytics Start with Accountable Data: The LiQuID Metadata Model</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Sarah</forename><surname>Oppold</surname></persName>
							<email>sarah.oppold@ipvs.uni-stuttgart.de</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Stuttgart</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Melanie</forename><surname>Herschel</surname></persName>
							<email>melanie.herschel@ipvs.uni-stuttgart.de</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Stuttgart</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">National University of Singapore</orgName>
								<address>
									<country key="SG">Singapore</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Accountable Data Analytics Start with Accountable Data: The LiQuID Metadata Model</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">E162E4040C2482BB1243CD12A9EA3067</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T05:27+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>Metadata Model • Accountability</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Insights based on data are omnipresent. However, in particular in modern data analytics applications, information about the underlying data often remain obscure, hindering accountable data analytics. Recent efforts have been put into better describing such data based on metadata, similarly to what has been done in various scientific disciplines for transparent and reproducible research. Based on a detailed study of various metadata standards and proposals, we observe that existing metadata models do not yet sufficiently cover information that is relevant for data accountability. To fill this gap, this paper proposes LiQuID, a novel metadata model to make datasets accountable throughout their life cycle. It is more general than existing metadata models, which can be mapped to LiQuID. We validate LiQuID for the purpose of dataset accountability based on a real-world workload we created.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Data underly various insights and decisions today, e.g., for dating recommendations, marketing decisions, scientific findings, or responses to pandemics such as COVID-19. The result of analyzing these data potentially influences various aspects of people's lives. Unfortunately, the development of data analysis pipelines, that rely on data, is prone to errors. Even though developers of such pipelines may have the best intentions, mistakes are likely to occur <ref type="bibr" target="#b9">[10]</ref>. To understand and account for decisions or insights drawn from data, an important aspect is to account for the underlying data itself, which includes being transparent about the creation, handling, purpose, and meaning of the data.</p><p>Not being aware of properties or intended purpose of data and (possibly inadvertently) mishandling and misinterpreting the data as a consequence can have significant repercussions. One example is the introduction of discrimination into decision support systems, as could be observed with the recidivism prediction system COMPAS <ref type="bibr" target="#b1">[2]</ref>. Another example arises in the COVID-19 pandemic, where lots of data have been shared in a word-wide effort to gain insights. However, as sites like Our World in Data 3 point out, caution has to be applied when reading reported numbers, as it is often unclear what they mean. For instance, are number of tests counted as swabs or individuals tested? What is the source of symptoms reported with cases? Was publishing the data rightful? The datasets would clearly benefit from accompanying descriptive data, i.e., metadata, to answer such questions.</p><p>More generally, governments, ethical review boards, scientists, engineers, policy makers, and many more stakeholders need to assess and scrutinize data, e.g., to determine the appropriateness of the data for their purpose or to ensure that data are used correctly, ethically, and lawfully. Information pertinent to this assessment is not included in the data itself, it needs to be provided alongside the data as metadata. This information makes datasets more transparent, and can serve as evidence to verify compliance or appropriateness of data with respect to rules or requirements <ref type="bibr" target="#b12">[13,</ref><ref type="bibr" target="#b15">16]</ref>. Thereby we obtain accountable datasets, which we understand as follows in this paper: Accountable datasets are datasets about which there is sufficient information to justify and explain the actions on these datasets to a forum of persons, in addition to descriptive information and information on the people responsible for it. In this paper, we propose to convey the necessary information in the form of metadata. This information enables dataset accountability, where all persons responsible for a dataset, i.e., all persons who have been involved in the life cycle of the dataset, must justify and explain their actions on the dataset with respect to a set of rules, e.g., laws, contracts, or moral rules, to a forum of persons in authority. Our notion of dataset accountability goes beyond information accountability <ref type="bibr" target="#b15">[16]</ref>, which focuses on the appropriate use of data, leaving out all other steps in the life cycle of a dataset such as its creation or maintenance. It also complements algorithmic accountability <ref type="bibr" target="#b16">[17]</ref>, which is about the justification of entire algorithmic systems.</p><p>Clearly, the metadata for accountable datasets are very diverse and broad. They cover all phases of the life cycle of a dataset (including data collection, processing, maintenance, and usage) and address different aspects (e.g., meaning, purpose, responsible parties, or ethical considerations). While it is possible to obtain some of the necessary information when "releasing" the data for further use, some pieces of information such as design decisions or responsibilities require collection along the dataset generation process or even prior to its start. Planning in advance what information should be gathered and incorporating this into the design process is therefore beneficial for holistically accountable datasets.</p><p>This paper presents a metadata model for accountable dataset that gives a clear structure on what information is possibly relevant and provides guidance on what questions to consider when handling datasets. It is systematically designed along two dimensions: the first dimension models the different phases of the data life cycle, while the second dimension models essential questions (how, what, why, etc.) that can be asked about each phase. The information answering each question in each life cycle step is structured following five key fields or attributes. Overall, the metadata model, which we call LiQuID <ref type="foot" target="#foot_0">4</ref> is defined such that it can accompany any dataset, e.g., from initial data sources to datasets resulting from complex processing.</p><p>There are plenty of existing, highly domain-specific metadata models that can be considered candidates for supporting dataset accountability, as they have been established to make items of interest and corresponding metadata, e.g., findable, accessible, interoperable, reusable, and repeatable <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b13">14]</ref>, also known as FAIR principles <ref type="bibr" target="#b17">[18]</ref>. Focusing on datasets as particular items of interest, <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9]</ref> can be considered emerging approaches towards metadata models for accountable datasets. However, these have been defined in a rather ad-hoc fashion. We show in this paper that LiQuID generalizes the aforementioned existing metadata models. A detailed study of how existing models fit into our metadata model demonstrates both the appropriateness of LiQuID and the gaps in existing models in terms of dataset accountability.</p><p>To be of practical use, it is important that a metadata model for accountable datasets covers the metadata necessary for typical questions that arise when datasets are evaluated or verified. Therefore, we determine a real-world query workload on accountable data, based on analyzing audit literature, the GDPR, and conducting an expert survey. We observe that LiQuID is the only model we are aware of that can answer all queries of the workload, validating LiQuID's completeness. We further see that the workload requires a substantial fraction (75%) of the fields modeled by LiQuID, indicating its conciseness. No other metadata model can fully handle the workload, and 10% of fields required by the workload are not present in any considered existing metadata model.</p><p>In summary, we make the following contributions: (Section 2) a novel metadata model for accountable datasets, called LiQuID; (Section 3) a detailed analysis of existing metadata models with respect to dataset accountability that demonstrates both the appropriateness of our model and the gaps in existing models for accountable data; and (Section 4) a real-world data accountability query workload which we use to validate the completeness of LiQuID.</p><p>2 LiQuID: a metadata model for accountable data sets This section presents LiQuID, for which we set the following requirements:</p><p>1. Holistic view: The metadata model covers the whole life cycle of a dataset. 2. Systematic structure: A systematic structure offers a clear guidance on what information is potentially relevant. 3. Accountability: Following our notion of dataset accountability, the metadata model should (i) include information on responsible entities (e.g., creators, dataset managers) who can be held responsible for the handling of the data, as well as (ii) leave room for explanations and justifications in anticipation of an accountability discussion. 4. Extension: The metadata model builds on existing and time-tested approaches, maintaining and supporting features that have proven to be important (e.g., type descriptions, ontologies, FAIR principles <ref type="bibr" target="#b17">[18]</ref>).</p><p>After describing the general hierarchical metadata model in Section 2.1, we provide selected details in Section 2.2. Section 2.3 discusses how LiQuID meets the requirements mentioned above. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Metadata model overview</head><p>Figure <ref type="figure" target="#fig_0">1</ref> summarizes our metadata model for accountable data. It is a hierarchical model comprising three levels: the top level models the different data life cycle steps, we therefore call it lifecycle level. For each such step, the second level, named question level, structures metadata according to questions relevant for accountable data. The information level at the leaf level models the actual information per life cycle step and question. Note that by extending a general metadata interface, each element of the metadata model is uniquely identified and may have multiple versions. Note, that a full XSD is also available on our project website. <ref type="foot" target="#foot_1">5</ref>Life cycle level. In the life cycle level, we consider four essential steps in the life cycle of a dataset. The first step is data collection that relates to the creation, gathering, or capture of the data. Data collection typically involves manual entry or gathering, automatic capture of data produced through various processes, or the acquisition of third party data. The data processing step covers all data manipulations that have altered or transformed the data. Frequently applied data manipulations during preprocessing include data standardization, data cleaning, or aggregation. Under data maintenance, we understand the handling of the dataset once it has been released for further use. This includes a wide range of data management operations, e.g., updates, additions, deletions of (some) data, its archival, or destruction. Finally, data usage takes into account past, present and anticipated activities supported by or applied on the dataset, e.g., input to machine learning algorithms or distribution to other parties. Although some-times seen as separate data life cycle steps, note that we consider information on data storage and distribution as part of the information on data usage, as they are essential information when the need to account for proper use of data arises and are thus jointly queried with the data usage information. Question level. The second level structures information that LiQuID covers for every step of the life cycle by commonly used WH-Questions: Why?, Who?, When?, Where?, How?, What?. This categorization follows the general human rationale of asking for information about an entity of interest. While this may appear simplistic, we believe this simplicity allows to easily understand and use LiQuID. We show later that this structure actually covers all the information contained in other metadata models -and more, providing evidence that this intuitive model is nevertheless effective in covering the required information. Information level. While the first two levels essentially serve to contextualize the information to be provided for accountable data, the information level organizes the information needed for each life cycle step and question in five fields. The first field is a description that answers a WH-question for a data life cycle step. In order to invite explanations and justifications, which are essential for accountability, the information level additionally models fields for explanation, legal considerations, ethical considerations as well as limitations of the answer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Information details</head><p>To get a better understanding of what information our metadata model covers, Table <ref type="table">2</ref> summarizes what is understood as relevant information for three combinations of a life cycle step S and a question Q, denoted as S.Q. An exhaustive description for all combinations is available on our project website. As the examples in Figure <ref type="figure" target="#fig_1">2</ref> show, we associate a list of questions with each field of each S.Q combination. These are intended to help populate the metadata sheets.</p><p>In the subsequent discussion, we focus on the questions to consider when filling out the information relating to collection.who. We make up a simplistic example to illustrate the potential content of each field. The example considers a hospital that collects case numbers for a particular disease.</p><p>-Description: Considering the question Who? during data collection, the description includes information on who (people, organizations) was involved in the data collection process. It further encompasses any information relevant to their identification, their role in the data collection process, information necessary to assess their qualifications to fulfil this role, and any details about these people or organizations that may impact the data. In our example, we would report the hospital and the head of the service responsible for collecting accurate numbers. -Explanation: As part of the explanation, a justification on why these particular people were involved in the data collection process can be provided.</p><p>Continuing our example, we explain that this hospital is collecting these numbers as they are the only medical facility to treat the disease in a larger area. The responsible person is justified by her job description.  -Legal Considerations: To ensure that the persons involved in data collection had the right to do so, legal considerations recording why it was lawful that these people were involved in the data collection process are included in the metadata sheet. In our example, this includes an acknowledgement that the hospital is legally allowed to collect these data, e.g., based on disease control regulations. -Ethical Considerations: We also consider ethical questions, asking why it was ethically justifiable that these people were involved in the data collection? For instance, if the hospital receives funding depending on the number of cases, has a conflict of interest been ruled out? -Limitations: Finally, the metadata model offers the possibility to clarify (i) what limitations in the data set could result from the selection of persons involved in the data collection (based on their their characteristics or qualifications available in the description); (ii) what limitations for the overall objective (Why?) could result from the choice of people; (iii) what efforts have been made to mitigate the identified limitations; or (iv) why there are no limitations. In our example, a limitation is that the data may lag behind the actual situation given internal processes at the hospital. But mechanisms have been put in place to not lag behind by more than 24 hours.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Discussion</head><p>Having introduced both requirements for our metadata model and the model itself, we now review how LiQuID meets the requirements.</p><p>First, the metadata model should offer a holistic view on a dataset, covering its whole life cycle. This is achieved by the life cycle level of LiQuID that considers the essential steps of a dataset's life cycle.</p><p>A systematic structure that provides guidance on what information to consider is given by the overall hierarchical structure of LiQuID. For each data life cycle step, it asks WH-questions, which are the self-evident human rationale for assessment. Each question can be answered in a structured way, dictated by the information level.</p><p>Let us now review how LiQuID supports our accountability requirement. On the one hand, accountability is supported by asking for responsible entities which should be described in the Who? question of each life cycle step. On the other hand, the anticipated accountability discussion has to be modeled without actually being able to know the questions. But since the questions can be expected to be critical inquiries of the decisions made in the different life cycle steps, the metadata model encourages to think about such critical questions and leaves room for responses by providing the information fields for explanation, legal considerations, ethical considerations, and limitations.</p><p>Finally, the metadata model should be compatible with existing metadata models by extending these. As we will discuss in detail in Section 3, LiQuID generalizes existing models, which can be mapped into our metadata model. We also assume that details modeled by well-established standards, definitions, and ontologies are "docked" at the information level, i.e., each field modeled at the information level contains further structured elements that are application dependent.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Comparative assessment</head><p>This section compares our metadata model for accountable data with established, time-tested, and revised metadata models used in various disciplines. Even though their subject of interest and purpose differ from our metadata model for accountable datasets, they implicitly represent accumulated knowledge of what information is deemed important to describe some subject of interest. More specifically, we map nine existing metadata models to LiQuID. To this end, any field specified by an existing model is mapped to the corresponding field(s) in LiQuID. Note that we obtain a complete mapping, in the sense that we could map all information modeled by an existing metadata model to LiQuID.</p><p>For our comparative assessment, we choose metadata models with varying specificity and from various domains. These include two general models <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b14">15]</ref>, four standards to describe an item of interest arising in various domains <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b13">14]</ref>, and three emerging metadata models for fair, accountable, and transparent datasets <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9]</ref>.</p><p>-Dublin Core (DC) <ref type="bibr" target="#b4">[5]</ref>, one very general metadata model we consider is a conceptual generic model often used as base for other models; -W3C PROV (PROV) <ref type="bibr" target="#b14">[15]</ref>, another very general metadata model which focuses on describing the lineage of some end product; -Describing Archives: A Content Standard (DACS) <ref type="bibr" target="#b13">[14]</ref> specifies archiving principles and a metadata model for (aggregations of) archival records on, e.g., books, reports, or movies; -Access to Biological Collection Data (ABCD) <ref type="bibr" target="#b0">[1]</ref>, a metadata model implementation for biological sample collections; -Observations and Measurements (OM) <ref type="bibr" target="#b10">[11]</ref> emerged from the Open Geospatial Consortium and defines a general conceptual schema for observations and measurements as well as sampling details; -Data Documentation Initiative Lifecycle (DDI-L) <ref type="bibr" target="#b3">[4]</ref>, a metadata implementation describing (groups of) social studies based on questionnaires; -Datasheets for Datasets (DS) <ref type="bibr" target="#b7">[8]</ref> document (personal) datasets allowing them to be examined for new machine learning applications within the context of fair machine learning; -Data Nutrition Labels (DNL) <ref type="bibr" target="#b8">[9]</ref> provide automatically generated modular labels that describe datasets and are intended to enable accountable AI; -Data Statements for NLP (DNLP) <ref type="bibr" target="#b2">[3]</ref> describe spoken or written texts in order to enable fair natural language processing.</p><p>To determine to what extent the existing metadata models cover our model, we study the existing models in detail and map the entries they specify to LiQuID. Figure <ref type="figure" target="#fig_2">3</ref> depicts a visualization of this mapping. The columns reflect the leaves of our hierarchical model (i.e., each column corresponds to an information field under a question and life cycle step, information fields being in the order listed in Section 2.2). Rows filled with colors represent the metadata models listed above. Rows with label * filled with dark color aggregate a group of metadata models. A colored cell indicates that there is at least one specified entry in the metadata model (row) which corresponds to the respective information field in this metadata model (column). Different colors are used to distinguish the different life cycle steps to enhance readability. We color the fields generously, which means fields (i) with only little information on the respective combination of life cycle step S, question Q, and detail D, denoted S.Q.D or (ii) not explicitly meant but amenable for the specific S.Q.D have been colored. If a field is left blank, this indicates that there is no entry in the metadata model of the row ("notes" or "additional comments" set aside) that corresponds to S.Q.D identified by the column. At the end of each row, we also provide a coverage percentage, calculated as the number of details (cells) covered by a metadata model, divided by the number of detailed fields in LiQuID.</p><p>Interestingly, Figure <ref type="figure" target="#fig_2">3</ref> shows that both general metadata models cover about 30% of LiQuID. Even combined they only cover 51.7%. Both standards contain few fields, some of them too general to be mapped to specific LiQuID fields.</p><p>Figure <ref type="figure" target="#fig_2">3</ref> shows that the lowest coverage of 9% is achieved by OM and DNLP. The low coverage of OM can be explained, since the standard describes geological specimen for which an accountability discussion is unlikely. Additionally, these specimen typically do not undergo processing and maintenance life cycle steps.</p><p>More interestingly, we observe that while DNL and DS have higher coverage than DNLP, the coverage of these metadata models, which have been proposed with accountability use cases in mind, is generally low. Aggregating them still only covers around 31% of the details considered in LiQuID. This clearly shows that while the proposed metadata models may serve well the specific application they were engineered for (e.g., information for developers of machine learning pipelines <ref type="bibr" target="#b8">[9]</ref>), they do not provide a general metadata model for accountable datasets. This is further validated once we consider a query workload over accountable datasets in Section 4.</p><p>Focusing on the domain-specific metadata-models, we see that their coverage highly varies between 9% (OM) and 75% (DDI-L). While their individual coverage may only be moderate, we observe that they cover different details to a different degree. Indeed, the standards complement each other, as shown by the aggregated coverage on this group of 82.5%.</p><p>This high combined coverage of 90.8% shows that many fields included in LiQuID are already deemed important by existing metadata models. However, the systematic structure of LiQuID also reveals "blind spots", as it includes additional fields where information is still missing from any of the considered metadata models. For instance, we note that while all data life cycle phases are considered, data maintenance is covered less. However, it is reasonable to assume that accountability questions on data maintenance arise, for example when personal data has to be corrected or deleted due to an opt-out of a data subject. Looking at the questions level, the Why? question is the least covered element, which is surprising since the management of data should ideally have a goal. Finally, at the information level, both explanations and ethical considerations are scarcely covered by the considered existing data models.</p><p>In summary, we observe that existing standards and emerging data models with accountability use cases in mind, can all be fully mapped to our metadata model for accountable datasets. The converse does not hold, as LiQuID is not covered 100% by any considered data model. To understand how relevant the information that LiQuID covers is for dataset accountability, we determine an accountability workload and study which information it actually queries.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">A query workload over accountable datasets</head><p>Ideally, a predefined benchmark would be used in order to assess the metadata model objectively. However, we are not aware of any benchmark considering accountability by including a set of questions or queries which are realistic in datset accountability scenarios. We therefore contribute a first such benchmark by creating a workload of queries on accountable datasets and then assess LiQuID with respect to this workload. We first introduce our methodology to create the workload in Section 4.1. Section 4.2 then discusses how LiQuID fits this workload. The full workload is also available on our project website.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Creating the workload</head><p>Sources of real-world accountability questions. To determine a realistic workload of queries on metadata models for accountable datasets, sources are needed that describe existing practices, regulations, and questions that arise in settings requiring accountability with respect to data.</p><p>We identify three such sources. Our first source comes from the Federal Trade Commission (FTC) <ref type="bibr" target="#b6">[7]</ref> and establishes a list of guidelines or statements relating to accountability as part of an in-depth study on data brokers. Data brokers collect personal data about individuals from different sources and sell these data to companies. In an effort to create more transparency on data brokers, the FTC made data brokers answer questions, which they provide in their report. This shows how regulators conduct real audits and the report includes 101 statements relating to accountability. One sample statement asks to "Provide a list and description as to the nature and purpose of all the products and services (both online and offline) that the Company offers or sells that use personal data. Include a separate description of each product or service identified[.]".</p><p>Second, we consider regulations from the General Data Protection Regulation (GDPR) <ref type="bibr" target="#b5">[6]</ref>, which aims at protecting personal data. It is one of the most restrictive data protection regulations and focuses on data processing and therefore we expect it to be a tough test for metadata models for accountable datasets.</p><p>Beyond clarifying what data protection regulators will test for, it also takes into account data subjects who have the right to contest who uses data about them. From the GDPR regulations, we derive possible questions that aim at verifying the regulations. As an example, consider the following regulation from GDPR Article 3 <ref type="bibr" target="#b1">(2)</ref>: "This Regulation applies to the processing of personal data of data subjects who are in the Union by a controller or processor not established in the Union, where the processing activities are related to: (a) the offering of goods or services, irrespective of whether a payment of the data subject is required, to such data subjects in the Union; or (b) the monitoring of their behaviour as far as their behaviour takes place within the Union.". We associate this to the following accountability questions that may be asked when verifying if and how a regulation applies: "Are the data subjects of the personal data you process in the European Union? Are the personal data processing activities related to the offering of goods or services to data subjects in the European Union? Are the personal data processing activities related to the monitoring of data subjects behavior that takes place in the European Union?".</p><p>Lastly, we conducted an expert surveyin order to determine questions asked in dataset assessment deemed relevant by experts. This expert survey extends beyond personal data, which is the focus of both FTC and GDPR. Ten experts from various domains (including librarians, data management experts, doctors, social scientists) participated in the survey. They explained what criteria are important to them when they assess a dataset and what questions they would ask to assess these criteria in the different data life cycle phases.</p><p>Overall, from these sources, we obtain 183 textual descriptions of what information is relevant in real-world data accountability scenarios.</p><p>From textual descriptions to structured queries. In a next step, we determine a query language that allows us to query the data corresponding to the 183 textual descriptions, assuming the data are represented hierarchically (as in our metadata model). Given the textual descriptions, we observe that the query language needs to support different constructs, in particular, conditions, comparisons, for-loops, and the use of equality constraints. Given these requirements, we opt to use XQuery as query language, as it meets all requirements.</p><p>Following the questions and statements from the three sources, we derive XQueries, defined over an XML Schema that follows our metadata model. That is, when writing the queries, we determine from which fields of our model the relevant information can reasonably be retrieved. Note that we do not claim that our queries cover all possibilities. Also, note that our queries are the result of a best-effort approach to resolve ambiguities in questions or statements. To simplify our queries, we assume further elements nested under the elements defined by our metadata model that structure the data. In practice, such elements may be the result of a domain-specific ontology for different accountability use cases, as supported by our extension requirement.</p><p>For example, Algorithm 1 shows the XQuery that translates the FTC statement provided above (repeated in the algorithm's header for convenience). The color coding indicates semantic correspondences between the text and the query. First, the query identifies (who?) the Company named "myCompany" who acts as "Service provider" by offering or selling products and services. It further checks that the company uses (what?) personal data (why?) to include in their products or services. When all these conditions are met, we return the description of the identified product or service, assuming it includes a name and a description. We explain the purpose of the product or service that processes personal data. Algorithm 1: XQuery example derived from the FTC statement "Provide a list and description as to the nature and purpose of all the products and services (both online and offline) that the Company offers or sells that use personal data. Include a separate description of each product or service identified[.]" </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Evaluation of the metadata model wrt. the workload</head><p>This section studies how our metadata model supports the real-world accountability workload obtained as described in the previous section.</p><p>From the 183 textual statements and questions, we were able to express 97% with our metadata model. The remaining 3% are statements that refer to (i) information on why some measure was not taken, which is not modeled, because it did not happen, or (ii) unintentional data manipulations, which would be intentional as soon as they are modeled in the metadata.</p><p>Next, we study which fields of our metadata model are covered by the queries of our workload. Figure <ref type="figure" target="#fig_3">4</ref> shows the coverage of queries based on FTC, GDPR, and the expert survey, as well as the coverage when unifying all queries (row marked with * and with black color, ignore flags for now). The visualization is analogous to the visualization in Figure <ref type="figure" target="#fig_2">3</ref>. A field is colored when at least one query of the workload refers to it, and coverage is the number of fields referred to by a workload, divided by the 120 fields available in our metadata model.</p><p>First, we observe that the coverage of workloads of different sources varies between 43.3% for GDPR and 52.2% for the expert survey. However, the workloads complement each other and, when combined, access 75% of the fields in our metadata model. While the necessity of 30 fields of our metadata model is not demonstrated by the workload, we clearly see that a substantial number of fields not covered by any metadata model devised with accountability scenarios is relevant in our workload (cf. Figure <ref type="figure" target="#fig_2">3</ref>).</p><p>Among the fields accessed when combining all three workloads, we see that 9 of these fields (flagged fields among the black fields in Figure <ref type="figure" target="#fig_3">4</ref>) are among the 11 fields not covered by any other considered metadata model (left white in Figure <ref type="figure" target="#fig_2">3</ref>). This validates that our systematic structure and approach in defining the metadata model has contributed to identifying relevant fields not considered by other metadata models.</p><p>Finally, assuming that any field that is either accessed by our real-world accountability workload or has been defined by a previous metadata model is relevant, we see that 94.2% of fields modeled by LiQuID are relevant.</p><p>In conclusion, our study of how LiQuID relates to related work and real-world workloads demonstrates that our metadata model successfully covers a wide range of accountability queries and generalizes existing metadata models well, indicating the completeness of the proposed metadata model. At the same time, it is sufficiently concise, as it does not model a significant amount of information for which relevance still needs to be demonstrated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion and Outlook</head><p>To summarize, we presented a novel metadata model for accountable datasets. It hierarchically information relevant in scenarios requiring dataset accountability, covering different steps of the data life cycle, various questions arising at each step, and structuring the answers based on five attributes. We presented a detailed review of metadata models that can be considered candidates to enable accountable datasets. We observed that our metadata model can fully cover these, while being more general by modeling additional information. That this additional information is indeed relevant for accountable datasets is validated based on a real-world workload of queries arising in dataset accountability scenarios. Overall, our metadata model is the first model we are aware of that is rich enough to answer all queries of the defined workload.</p><p>While the insights gained through the research conducted in this paper are encouraging for using the proposed metadata model in practice, there are still quite a few challenges to tackle as part of future research. First, as we experienced ourselves, informing all fields of the metadata model is a tedious and timeconsuming task. Therefore, we plan to investigate how to automatically or semiautomatically fill fields. Another avenue of future research is the integration of accountable datasets in a larger environment, such as a system for accountable decision support <ref type="bibr" target="#b11">[12]</ref>. For this, the metadata about datasets needs to be linked to metadata collected about other parts of a system to give a holistic view.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Overview of LiQuID, showing the different levels of the hierarchical model, namely the life cycle level , question level , and: information level .</figDesc><graphic coords="4,134.77,126.15,345.83,184.88" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Examples of information covered for different S.Q combinations.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 3 .</head><label>3</label><figDesc>Fig. 3. Mapping between LiQuID (columns) and other existing metadata models.</figDesc><graphic coords="8,134.77,126.15,345.82,186.93" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 4 .</head><label>4</label><figDesc>Fig. 4. Workload coverage.</figDesc><graphic coords="13,134.77,126.15,345.83,77.56" type="bitmap" /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_0">Name refers to the modeled Life cycle steps, Questions, and Information about Data</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_1">https://www.ipvs.uni-stuttgart.de/departments/de/research/projects/fat dss/</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<ptr target="http://www.tdwg.org/standards/115" />
		<title level="m">Access to Biological Collection Data task group: Access to Biological Collection Data (ABCD)</title>
				<imprint>
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Angwin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Larson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mattu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kirchner</surname></persName>
		</author>
		<ptr target="https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing" />
		<title level="m">Machine bias: There&apos;s software used across the country to predict future criminals. and it&apos;s biased against blacks</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Data statements for natural language processing: Toward mitigating system bias and enabling better science</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">M</forename><surname>Bender</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Friedman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="587" to="604" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<ptr target="https://ddialliance.org/Specification/DDI-Lifecycle/3.2/" />
		<title level="m">Data Documentation Initiative: DDI lifecycle 3</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<ptr target="https://www.dublincore.org/specifications/dublin-core/dcmi-terms/" />
		<title level="m">DCMI Usage Board: DCMI metadata terms</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data</title>
	</analytic>
	<monogr>
		<title level="m">and on the free movement of such data, and repealing Directive 95/46/EC</title>
				<imprint>
			<publisher>General Data Protection Regulation</publisher>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note>Council of the European Union: Regulation (</note>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m">Data brokers: A call for transparency and accountability</title>
				<imprint>
			<publisher>Federal Trade Commission</publisher>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Datasheets for datasets</title>
		<author>
			<persName><forename type="first">T</forename><surname>Gebru</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Morgenstern</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Vecchione</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Vaughan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wallach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Iii</forename><surname>Daumeé</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Crawford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning</title>
				<meeting>the 5th Workshop on Fairness, Accountability, and Transparency in Machine Learning</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page">17</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">The dataset nutrition label: A framework to drive higher data quality standards</title>
		<author>
			<persName><forename type="first">S</forename><surname>Holland</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hosny</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Newman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Joseph</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Chmielinski</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">CoRR</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Social data: Biases, methodological pitfalls, and ethical boundaries</title>
		<author>
			<persName><forename type="first">A</forename><surname>Olteanu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Castillo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Diaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kiciman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SSRN Electronic Journal</title>
		<imprint>
			<biblScope unit="volume">47</biblScope>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<ptr target="https://www.ogc.org/standards/om" />
		<title level="m">Open Geospartial Consortium: Observations and Measurements</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">A system framework for personalized and transparent data-driven decisions</title>
		<author>
			<persName><forename type="first">S</forename><surname>Oppold</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Herschel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advanced Information Systems Engineering</title>
				<imprint>
			<publisher>Springer International Publishing</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page">16</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Decision provenance: Harnessing data flow for accountable systems</title>
		<author>
			<persName><forename type="first">J</forename><surname>Singh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Cobbe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Norval</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page" from="6562" to="6574" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<ptr target="https://saa-ts-dacs.github.io/" />
		<title level="m">The Society of American Archivists: Describing archives: A content standard</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<ptr target="https://www.w3.org/TR/prov-overview/" />
		<title level="m">W3C Working Group: An overview of the PROV family of documents</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Information accountability</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">J</forename><surname>Weitzner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Abelson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Berners-Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Feigenbaum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hendler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">J</forename><surname>Sussman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Communications of the ACM</title>
		<imprint>
			<biblScope unit="volume">51</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="82" to="87" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">What to account for when accounting for algorithms: A systematic literature review on algorithmic accountability</title>
		<author>
			<persName><forename type="first">M</forename><surname>Wieringa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Fairness, Accountability, and Transparency</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="1" to="18" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">The FAIR guiding principles for scientific data management and stewardship</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">D</forename><surname>Wilkinson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Scientific Data</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="issue">1</biblScope>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
