=Paper=
{{Paper
|id=Vol-3916/paper3
|storemode=property
|title=Data and Dataset Quality for Artificial Intelligence Systems
|pdfUrl=https://ceur-ws.org/Vol-3916/paper_03.pdf
|volume=Vol-3916
|authors=Domenico Natale
|dblpUrl=https://dblp.org/rec/conf/iwesq/Natale24
}}
==Data and Dataset Quality for Artificial Intelligence Systems==
Data and Dataset Quality for Artificial Intelligence
Systems
Domenico Natale1
1 Chair of Software engineering Italian Technical Commission UNI TC 504, mirror of SC7/WG6; UNI TC 533 AI; member of
CEN/CLC JTC 21 AI/WG3 Engineering aspects
Abstract
This paper deals with the standard on data quality for artificial intelligence (AI) with the extension of the
concept of data to datasets. It presents the evolution from SQuaRE born in traditional environment towards
AI considering the new characteristics of datasets essential for machine learning (ML) and AI applications.
Keywords
Artificial intelligence, data, dataset, quality models, characteristics, measures, evaluation 1
1. Introduction characteristics corresponding to the specific AI
requirements. The names of some data measures are
The document aims to promote a common language and cited in paragraph 3.2, as well as some results combined
a culture of data quality. AI systems, in particular with software aspects useful for evaluation of AI
machine learning, require data, knowledge and practical systems [4].
experiences to be effective. Systems can be aided by the
provision of high-quality data and datasets, including 2. Evolution
algorithms to process them.
The emergence of AI systems has necessitated a broad Why do we talking about data quality standards?
reassessment of data quality. This paper discusses the Because standards play an important role in offering
growing importance of data quality when considering better products, harmonizing terms and processes,
AI systems, highlighting the definitions provided in facilitating communication, enabling synergies,
ISO/IEC 5259 series and several data-related standards. increasing interoperability of systems and data
The management involved can adopt quality models for exchange, with common semantics and with shared
data and provide new and adequate processes, in order metadata, contributing to the reduction of costs and
to guarantee the expected results. incidents.
This paper discusses data quality standards in artificial It must be considered that data is not a digital artefact,
intelligence (AI) systems. Building on traditional data as often happens with software, but above all represents
quality models such as ISO/IEC 25012 and ISO/IEC 25024, the fact itself and the criterion of truth for each of its
it proposes how these standards are evolving to meet aspects. Data is therefore reality, the interpretable
also the unique demands of AI and machine learning representation of what exists. They are the elements on
(ML) applications. which decisions, improvements and developments can
Since 2008, the most complete standard on data quality be based. Data, which follows standard rules, are not a
product is considered the ISO/IEC 25012 “Data quality convention, but provide information and knowledge of
model” [1], complemented by the standard ISO/IEC reality. To describe the concept of data quality, standard
25024 “Measurement of data quality” [2], also important 25012 defined 15 characteristics, inherent and system
for evaluation. With the advent of Artificial Intelligence, dependent. Standard 25024 has defined 63 measures to
the data quality model has been extended to consider not support the quantification of the level of quality
only individual data, but also groups of data (datasets) achieved.
necessary for some algorithms, and for machines Thus, it is happening that AI perspective strengthens
learning and procedures. data and data quality putting their features at the center
For this reason, ISO/IEC 25012, and ISO/IEC 25024, of interest, but also requiring the extension of quality
developed by SC7/WG6, have been imported into and measures to datasets, adding 9 characteristics and
ISO/IEC 5259-2 “Data quality measures” [3], with the some measures.
addition of quality characteristics related to datasets, For AI systems, the qualitative aspects listed below are
managed by SC42/WG2. The relationship between the encounter with new qualitative characteristics:
dataset quality model and measures can improve the representativeness (with respect to the population to
performance of the AI system and the design of quality which the AI is intended, refer without discrimination),
identifiability or de-identifiability, balance, similarity,
IWESQ’24: International Workshop on Experience with SQuaRE dnatale51@gmail.com (D. Natale)
Family and its Future Direction, December 03, 2024, Chongqing, © 2024 Copyright for this paper by its authors. Use permitted under
CN Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
diversity, effectiveness, relevance, timeliness, 3.1. Data quality
auditability, as defined in 5259-2 for machine learning.
The various aspects mentioned highlight the need for ISO/IEC 25012 defines a quality model for data, stored in
specific legislative governance, processes, coordination a structured format within a computer system. It defines
of products and technical approaches that include fifteen quality characteristics for target data used by
training, management preparation, new processes and humans and systems. It considers all data types (e.g.
dissemination of product quality principles for the end characters strings, texts, dates, numbers, images, sounds,
user. Data governance will need to ensure, among other etc.); the scope does not include data produced by
things, transparency, stakeholder collaboration and embedded devices or real time sensors that are not
accountability in the different areas of data retained for further processing or historical purposes.
management. The data quality characteristics defined in 25012 are
divided in inherent and system dependent. From the
Progress must be made quickly due to the pervasive
inherent point of view, data have the intrinsic potential
impact of AI, considering data and information as an to satisfy stated and implied needs; from the system
aspect of knowledge, to be explored and recognized from dependent point of view, the quality of data depends on
different points of view: the technical domain in which data are used.
a) Constraints : laws, legal and technical quality
requirements, human rights, ethics, quality models, The fifteen data quality characteristics defined are
methodologies, governance, social impacts, monitoring classified in the following:
quality in use;
b) Development: management of systems that process - Inherent: accuracy, completeness, consistency,
input and output data , assessment of data and software credibility, currentness;
quality, taking into account cybersecurity, risk classes - Inherent and system dependent: accessibility,
and automation; compliance, confidentiality, efficiency, precision,
c) Technologies: cloud, big data, quantum systems, traceability, understandability;
- System dependent: availability, portability,
neural networks, language management, generative
recoverability.
models, machine learning , robots, devices, IOT, sensors,
hardware, etc..
For each of these points of view, it is reiterated that data, ISO/IEC 25024 specifies 63 measures to quantify the data
quality level in relation to characteristics.
information and knowledge are essential topics to
5259-2 describes 42 measures imported from 25024,
achieve trustworthiness in the new technology. some adopted as is, some adapted for artificial
intelligence.
3. Estending SQuaRE to AI
To better understand the scope of the models developed 3.2. Datasets quality
by SC7/WG6 in ISO/IEC 25000 family [5], widely used
The new quality characteristics od datasets defined in
in various countries and in Italy, we have to move from
5259-2 can be divided according to the following point
SC7 to the context of SC42 which has identified
of view:
additional quality aspects for datasets. The data
Technical: Auditability
document developed by SC42 has not limited itself to
Legal: Identifiability
the quality of the data described in 5259-2, but also to
Reality oriented: Balance, Diversity, Effectiveness,
the quality of other document phases related to the
Relevance, Representativenes, Similarity, Timeliness
machine learning. Data quality are included in a
structured and complete view of the standards:
In addition, the aspects of provenance and randomness
• 5259-1: overview, terminology, examples are considered in 5259-2 as examples of essential factors
• 5259-2: data quality measures for preparation of datasets to reinforce the credibility
• 5259-3: data quality management characteristic.
requirements and guidelines
• 5259-4: data quality process framework Regarding the quality characteristics of datasets in 5259-
2, measures to quantify quality are also reported. Names
• 5259-5: data quality governance framework
of the measures described in the standard for each
• CD TR 5259-6: visualization framework for characteristics are related to: audited records, resolution
data quality balance, balance of images, label proportion balance,
This paper summarizes in paragraph 3.2 the essential category size diversity, category size effectiveness, label
aspects of the standard 5259-2 and its characteristics. effectiveness, identifiability ratio, relevance data in a
record, representativeness ratio, sample similarity, products or services that use AI systems. The document
timeliness of data items. is intended to help the organization develop, provide or
use AI systems responsibly in pursuing its objectives
Regarding the aspects of balance, representativeness and and meeting applicable requirements, stakeholder
risks of bias (unwanted discrimination of datasets), other obligations related and expectations araising from them,
examples are also available from literature [6]. paying attention to data quality in the Annex B [7].
For data quality control, measurement is very important ISO/IEC 25059 outlines a quality model for AI systems
to support quality assessment. and is a specific extension from SQuaRE [8]. The
Measurement is a set of operations having the object of characteristics and sub-characteristics detailed in the
determining a value of a measure. The most relevant model provide consistent terminology for specifying,
measures in the contest of data quality are related to the measuring and evaluating AI system quality. They also
level of quality achieved (threshold) both for individual provide a set of quality characteristics against which the
data and for the group of data in a dataset. stated quality requirements can be compared for
The adoption of the data quality model is essential for completeness. For example functional adaptability of
verification and evaluation. When following some systems concerns the degree to which an AI system can
specific steps to design, control and manage data quality, accurately learn from data or the outcome of previous
develop and use AI systems, it is necessary to take into actions and use that information in future predictions.
account a detailed data life cycle for different purposes Furthermore, in the new sub-characteristic, robustness
(e.g. data governance, data acquisition, format of reliability, is used to describe the ability of a system
normalization, data bias, sources integration, storage, to maintain its level of functional correctness under any
maintenance, dismissioning). circumstances, including the presence of unseen,
distorted, adversarial or invalid data inputs [9].
Therefore, entire data quality models can be considered
an analytical guide for data preparation and not only a ISO/IEC TS 12971 describes how to address unwanted
useful instrument for evaluation and assessment of a bias in AI systems that use machine learning to perform
single product. classification and regression tasks. This document
provides mitigation techniques that can be applied
While data quality concepts and models currently follow throughout the life cycle of the AI system in order to
a predominantly deterministic orientation in 5259-2, treat unwanted bias. This document is applicable to all
functional measures to quantify datasets quality levels types and sizes of organizations [10].
can be extended in the future also to non-deterministic
approaches introducing statistical probability methods. ISO/IEC TR 24027 addresses bias in relation to AI
systems, particularly with respect to AI-aided decision-
4. Propagation of data quality making. Measurement techniques and methods for
assessing bias are described, with the aim to address and
models treat bias-related vulnerabilities. All phases of AI system
Data quality models can be considered a fundamental lifecycle are in the scope, including, but not limited to,
topic for AI that links new technology to various data collection, training, continual learning, design,
existing international standards. For a general testing, evaluation and implementation [11].
operational view we should also consider other
standards to be added to those mentioned, such as at 5. AI in Europe
least the following, wihich mention SQuaRE standards
and data models in the text and mentioned in the A global view of data and datasets quality is described in
bibliography: ISO/IEC 42001:2023 on AI management the technical report CEN CLC TR 18115 “Data
system, ISO/IEC 25059:2023 on quality model for AI Governance and quality for AI within the European
system, ISO/IEC TS 12791:2024 on treatment of context” [12]. The report, developed for information
unwanted bias, ISO/IEC TR 24027:2021 on bias in AI purposes, describes the essential standards for
systems. organizations navigating the digital transformation
towards AI, considering quality models, best practices,
ISO/IEC 42001 specifies requirements and provides social aspects, ethics, inclusiveness, healthcare. It is
guidance for establishing, implementing, maintaining addressed to organizations, agencies, enterprises,
and continually improving an AI management system in developers, universities, data scientists, researchers. It
the context of an organization. This document is cites standards, experiences and best practices,
intended for use by an organization that provides or uses providing an overview on the standards that aim to
promote a common language.
The report is consistent with the EU AI Act Requirements and Evaluation (SQuaRE) -
Regulation[13], that distinguish within harmonized Measurement of data quality
standards, under development, then compliant with [3] ISO/IEC 5259-2:2024 Artificial Intelligence - Data
the.regulation and prescriptive. All other existing quality for analytics and machine learning (ML) –
standards, such as those mentioned in this paper, can Part 2: Data quality measures
be considered, now, complimentary and voluntary. By
[4] ISO/IEC TS 25058: 2024 Systems and software
mapping the connections between articles of AI Act and
all standards, we obtain a useful structured catalog of engineering - Systems and software quality
“best practices” for the community of standardizers, Requirements and Evaluation (SQuaRE) -
legal experts. stakeholders, and users. The map, Guidance for quality evaluation of artificial
developed in the environment of CEN/CLC JTC21 with intelligence (AI) systems
the contribution of UNINFO, hosted by the Italian [5] ISO/IEC 25000:2014 Systems and software quality
association AI Open Mind, offers a structured view of Requirements and Evaluation (SQuaRE) – Guide to
legistation and technology. It connects concepts with SQuaRE (summary www.iso25000.it)
coherent terminology, from a syntactic and semantic [6] A. Simonetta, A. Trenta, M.C. Paoletti, A. Vetrò,
point of view, also allowing the possibility to update the Metrics for identifying bias in datasets”, CEUR,
SQuaRE quality models [5]. The AI Regulation contains
Workshop Proceedings, 2021
many topics and articles, such as: purpose (art.1), scope
(2), definitions (3), education (4), prohibited AI practices [7] ISO/IEC 42001:2023 Information technology -
(5), risk management (9), data and governance (10), Artificial intelligence – Management system
documentation (11), deployers (13), human oversight [8] D. Natale, Extension of ISO/IEC 25000 Quality
(14), quality managenment (17), obligation of importers Models to the Context of Artificial Intelligence,
– distributors (23-24), authorities (28), conformity in CEUR-WS, 2022
third countries (39), harmonized standards (40), [9] ISO/IEC 25059:2023 Software engineering -
conformity assessment (43), CE marking (48), AI office System and software Quality Requirements and
(64), testing (76), fundamental rights (77), etc.. Evaluation (SQuaRE) – Quality model for AI
systems
6. Conclusion [10] ISO/IEC TS 12791:2024 Information technology -
The title of this paper is just a tip of the iceberg. The Artificial intelligence - Treatment of unwanted
topic of data and the whole AI is huge. It is evolving bias in classification and regression machine
rapidly. The adoption of quality models for data and learning tasks
datasets and datasets and the measurement activities are [11] ISO/IEC TR 24027:2021 Information technology -
important for a complete quality assurance avoiding bias Artificial intelligence (AI) - Bias in AI systems and
AI aided decision making
and discrimination, supporting useful AI for an efficient
decision support system, avoiding bias and [12] CEN/CLC TR 18115:2024 “Data Governance and
quality for AI within the European Context”
discrimination. In addition, it must be considered as a
[13] Regulation (EU) 2024/1689 of the European
complement to quality of software to manage the data,
Parliament and of the Council of 13 June 2024
including features that are also important such as safety
and transparency. The propagation of SQuaRE quality
models in many standards confirms the value of 25000
family. Likewise, the adoption of a managed quality
system in the organization developing AI systems is
becoming mandatory, paving the way for conformity
assessment and quality certification. It is important to
underline the consistency of standards and compliance
with the law, improving the efficiency of AI systems
production, respecting ethical aspects, human centric
vision, inclusiveness and confiming trustworthiness.
References
[1] ISO/IEC 25012:2008 Software engineering -
Software product quality Requirements and
Evaluation (SQuaRE) - Data quality model
[2] ISO/IEC 25024:2015 Systems and software
engineering – Systems and software quality