=Paper= {{Paper |id=Vol-3916/paper3 |storemode=property |title=Data and Dataset Quality for Artificial Intelligence Systems |pdfUrl=https://ceur-ws.org/Vol-3916/paper_03.pdf |volume=Vol-3916 |authors=Domenico Natale |dblpUrl=https://dblp.org/rec/conf/iwesq/Natale24 }} ==Data and Dataset Quality for Artificial Intelligence Systems== https://ceur-ws.org/Vol-3916/paper_03.pdf
                                Data and Dataset Quality for Artificial Intelligence
                                Systems
                                Domenico Natale1
                                1 Chair of Software engineering Italian Technical Commission UNI TC 504, mirror of SC7/WG6; UNI TC 533 AI; member of
                                CEN/CLC JTC 21 AI/WG3 Engineering aspects

                                                 Abstract
                                                 This paper deals with the standard on data quality for artificial intelligence (AI) with the extension of the
                                                 concept of data to datasets. It presents the evolution from SQuaRE born in traditional environment towards
                                                 AI considering the new characteristics of datasets essential for machine learning (ML) and AI applications.

                                                 Keywords
                                                 Artificial intelligence, data, dataset, quality models, characteristics, measures, evaluation 1



                                  1. Introduction                                                 characteristics corresponding to the specific AI
                                                                                                  requirements. The names of some data measures are
                                  The document aims to promote a common language and              cited in paragraph 3.2, as well as some results combined
                                  a culture of data quality. AI systems, in particular            with software aspects useful for evaluation of AI
                                  machine learning, require data, knowledge and practical         systems [4].
                                  experiences to be effective. Systems can be aided by the
                                  provision of high-quality data and datasets, including          2. Evolution
                                  algorithms to process them.
                                  The emergence of AI systems has necessitated a broad            Why do we talking about data quality standards?
                                  reassessment of data quality. This paper discusses the          Because standards play an important role in offering
                                  growing importance of data quality when considering             better products, harmonizing terms and processes,
                                  AI systems, highlighting the definitions provided in            facilitating communication, enabling synergies,
                                  ISO/IEC 5259 series and several data-related standards.         increasing interoperability of systems and data
                                  The management involved can adopt quality models for            exchange, with common semantics and with shared
                                  data and provide new and adequate processes, in order           metadata, contributing to the reduction of costs and
                                  to guarantee the expected results.                              incidents.
                                  This paper discusses data quality standards in artificial       It must be considered that data is not a digital artefact,
                                  intelligence (AI) systems. Building on traditional data         as often happens with software, but above all represents
                                  quality models such as ISO/IEC 25012 and ISO/IEC 25024,         the fact itself and the criterion of truth for each of its
                                  it proposes how these standards are evolving to meet            aspects. Data is therefore reality, the interpretable
                                  also the unique demands of AI and machine learning              representation of what exists. They are the elements on
                                  (ML) applications.                                              which decisions, improvements and developments can
                                  Since 2008, the most complete standard on data quality          be based. Data, which follows standard rules, are not a
                                  product is considered the ISO/IEC 25012 “Data quality           convention, but provide information and knowledge of
                                  model” [1], complemented by the standard ISO/IEC                reality. To describe the concept of data quality, standard
                                  25024 “Measurement of data quality” [2], also important         25012 defined 15 characteristics, inherent and system
                                  for evaluation. With the advent of Artificial Intelligence,     dependent. Standard 25024 has defined 63 measures to
                                  the data quality model has been extended to consider not        support the quantification of the level of quality
                                  only individual data, but also groups of data (datasets)        achieved.
                                  necessary for some algorithms, and for machines                 Thus, it is happening that AI perspective strengthens
                                  learning and procedures.                                        data and data quality putting their features at the center
                                  For this reason, ISO/IEC 25012, and ISO/IEC 25024,              of interest, but also requiring the extension of quality
                                  developed by SC7/WG6, have been imported into                   and measures to datasets, adding 9 characteristics and
                                  ISO/IEC 5259-2 “Data quality measures” [3], with the            some measures.
                                  addition of quality characteristics related to datasets,        For AI systems, the qualitative aspects listed below are
                                  managed by SC42/WG2. The relationship between the               encounter with new qualitative characteristics:
                                  dataset quality model and measures can improve the              representativeness (with respect to the population to
                                  performance of the AI system and the design of quality          which the AI is intended, refer without discrimination),
                                                                                                  identifiability or de-identifiability, balance, similarity,


                                IWESQ’24: International Workshop on Experience with SQuaRE              dnatale51@gmail.com (D. Natale)
                                Family and its Future Direction, December 03, 2024, Chongqing,                    © 2024 Copyright for this paper by its authors. Use permitted under
                                CN                                                                                Creative Commons License Attribution 4.0 International (CC BY 4.0).



CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
diversity,     effectiveness,    relevance,      timeliness,    3.1. Data quality
auditability, as defined in 5259-2 for machine learning.
The various aspects mentioned highlight the need for            ISO/IEC 25012 defines a quality model for data, stored in
specific legislative governance, processes, coordination        a structured format within a computer system. It defines
of products and technical approaches that include               fifteen quality characteristics for target data used by
training, management preparation, new processes and             humans and systems. It considers all data types (e.g.
dissemination of product quality principles for the end         characters strings, texts, dates, numbers, images, sounds,
user. Data governance will need to ensure, among other          etc.); the scope does not include data produced by
things, transparency, stakeholder collaboration and             embedded devices or real time sensors that are not
accountability in the different areas of data                   retained for further processing or historical purposes.
management.                                                     The data quality characteristics defined in 25012 are
                                                                divided in inherent and system dependent. From the
Progress must be made quickly due to the pervasive
                                                                inherent point of view, data have the intrinsic potential
impact of AI, considering data and information as an            to satisfy stated and implied needs; from the system
aspect of knowledge, to be explored and recognized from         dependent point of view, the quality of data depends on
different             points             of            view:    the technical domain in which data are used.
a) Constraints : laws, legal and technical quality
requirements, human rights, ethics, quality models,             The fifteen data quality characteristics defined are
methodologies, governance, social impacts, monitoring           classified in the following:
quality in use;
b) Development: management of systems that process              - Inherent: accuracy, completeness, consistency,
input and output data , assessment of data and software         credibility, currentness;
quality, taking into account cybersecurity, risk classes        - Inherent and system dependent: accessibility,
and automation;                                                 compliance, confidentiality, efficiency, precision,
 c) Technologies: cloud, big data, quantum systems,             traceability, understandability;
                                                                - System dependent: availability, portability,
neural networks, language management, generative
                                                                recoverability.
models, machine learning , robots, devices, IOT, sensors,
hardware, etc..
For each of these points of view, it is reiterated that data,   ISO/IEC 25024 specifies 63 measures to quantify the data
                                                                quality level in relation to characteristics.
information and knowledge are essential topics to
                                                                5259-2 describes 42 measures imported from 25024,
achieve trustworthiness in the new technology.                  some adopted as is, some adapted for artificial
                                                                intelligence.
3. Estending SQuaRE to AI
To better understand the scope of the models developed          3.2. Datasets quality
by SC7/WG6 in ISO/IEC 25000 family [5], widely used
                                                                The new quality characteristics od datasets defined in
in various countries and in Italy, we have to move from
                                                                5259-2 can be divided according to the following point
SC7 to the context of SC42 which has identified
                                                                of view:
additional quality aspects for datasets. The data
                                                                Technical: Auditability
document developed by SC42 has not limited itself to
                                                                Legal: Identifiability
the quality of the data described in 5259-2, but also to
                                                                Reality oriented: Balance, Diversity, Effectiveness,
the quality of other document phases related to the
                                                                Relevance, Representativenes, Similarity, Timeliness
machine learning. Data quality are included in a
structured and complete view of the standards:
                                                                In addition, the aspects of provenance and randomness
     • 5259-1: overview, terminology, examples                  are considered in 5259-2 as examples of essential factors
     • 5259-2: data quality measures                            for preparation of datasets to reinforce the credibility
     • 5259-3: data quality management                          characteristic.
          requirements and guidelines
     • 5259-4: data quality process framework                   Regarding the quality characteristics of datasets in 5259-
                                                                2, measures to quantify quality are also reported. Names
     • 5259-5: data quality governance framework
                                                                of the measures described in the standard for each
     • CD TR 5259-6: visualization framework for                characteristics are related to: audited records, resolution
          data quality                                          balance, balance of images, label proportion balance,
This paper summarizes in paragraph 3.2 the essential            category size diversity, category size effectiveness, label
aspects of the standard 5259-2 and its characteristics.         effectiveness, identifiability ratio, relevance data in a
record,    representativeness ratio, sample similarity,      products or services that use AI systems. The document
timeliness of data items.                                    is intended to help the organization develop, provide or
                                                             use AI systems responsibly in pursuing its objectives
Regarding the aspects of balance, representativeness and     and meeting applicable requirements, stakeholder
risks of bias (unwanted discrimination of datasets), other   obligations related and expectations araising from them,
examples are also available from literature [6].             paying attention to data quality in the Annex B [7].

For data quality control, measurement is very important      ISO/IEC 25059 outlines a quality model for AI systems
to support quality assessment.                               and is a specific extension from SQuaRE [8]. The
Measurement is a set of operations having the object of      characteristics and sub-characteristics detailed in the
determining a value of a measure. The most relevant          model provide consistent terminology for specifying,
measures in the contest of data quality are related to the   measuring and evaluating AI system quality. They also
level of quality achieved (threshold) both for individual    provide a set of quality characteristics against which the
data and for the group of data in a dataset.                 stated quality requirements can be compared for
The adoption of the data quality model is essential for      completeness. For example functional adaptability of
verification and evaluation. When following some             systems concerns the degree to which an AI system can
specific steps to design, control and manage data quality,   accurately learn from data or the outcome of previous
develop and use AI systems, it is necessary to take into     actions and use that information in future predictions.
account a detailed data life cycle for different purposes    Furthermore, in the new sub-characteristic, robustness
(e.g. data governance, data acquisition, format              of reliability, is used to describe the ability of a system
normalization, data bias, sources integration, storage,      to maintain its level of functional correctness under any
maintenance, dismissioning).                                 circumstances, including the presence of unseen,
                                                             distorted, adversarial or invalid data inputs [9].
Therefore, entire data quality models can be considered
an analytical guide for data preparation and not only a      ISO/IEC TS 12971 describes how to address unwanted
useful instrument for evaluation and assessment of a         bias in AI systems that use machine learning to perform
single product.                                              classification and regression tasks. This document
                                                             provides mitigation techniques that can be applied
While data quality concepts and models currently follow      throughout the life cycle of the AI system in order to
a predominantly deterministic orientation in 5259-2,         treat unwanted bias. This document is applicable to all
functional measures to quantify datasets quality levels      types and sizes of organizations [10].
can be extended in the future also to non-deterministic
approaches introducing statistical probability methods.      ISO/IEC TR 24027 addresses bias in relation to AI
                                                             systems, particularly with respect to AI-aided decision-
4. Propagation of data quality                               making. Measurement techniques and methods for
                                                             assessing bias are described, with the aim to address and
    models                                                   treat bias-related vulnerabilities. All phases of AI system
Data quality models can be considered a fundamental          lifecycle are in the scope, including, but not limited to,
topic for AI that links new technology to various            data collection, training, continual learning, design,
existing international standards.     For a general          testing, evaluation and implementation [11].
operational view we should also consider other
standards to be added to those mentioned, such as at         5. AI in Europe
least the following, wihich mention SQuaRE standards
and data models in the text and mentioned in the             A global view of data and datasets quality is described in
bibliography: ISO/IEC 42001:2023 on AI management            the technical report CEN CLC TR 18115 “Data
system, ISO/IEC 25059:2023 on quality model for AI           Governance and quality for AI within the European
system, ISO/IEC TS 12791:2024 on treatment of                context” [12]. The report, developed for information
unwanted bias, ISO/IEC TR 24027:2021 on bias in AI           purposes, describes the essential standards for
systems.                                                     organizations navigating the digital transformation
                                                             towards AI, considering quality models, best practices,
ISO/IEC 42001 specifies requirements and provides            social aspects, ethics, inclusiveness, healthcare. It is
guidance for establishing, implementing, maintaining         addressed to organizations, agencies, enterprises,
and continually improving an AI management system in         developers, universities, data scientists, researchers. It
the context of an organization. This document is             cites standards, experiences and best practices,
intended for use by an organization that provides or uses    providing an overview on the standards that aim to
                                                             promote a common language.
The report is consistent with the EU AI Act                           Requirements and Evaluation (SQuaRE) -
Regulation[13], that distinguish within harmonized                    Measurement of data quality
standards, under development, then compliant with              [3]    ISO/IEC 5259-2:2024 Artificial Intelligence - Data
the.regulation and prescriptive. All other existing                   quality for analytics and machine learning (ML) –
standards, such as those mentioned in this paper, can                 Part 2: Data quality measures
be considered, now, complimentary and voluntary. By
                                                               [4]    ISO/IEC TS 25058: 2024 Systems and software
mapping the connections between articles of AI Act and
all standards, we obtain a useful structured catalog of               engineering - Systems and software quality
“best practices” for the community of standardizers,                  Requirements and Evaluation (SQuaRE) -
legal experts. stakeholders, and users. The map,                      Guidance for quality evaluation of artificial
developed in the environment of CEN/CLC JTC21 with                    intelligence (AI) systems
the contribution of UNINFO, hosted by the Italian              [5]    ISO/IEC 25000:2014 Systems and software quality
association AI Open Mind, offers a structured view of                 Requirements and Evaluation (SQuaRE) – Guide to
legistation and technology. It connects concepts with                 SQuaRE (summary www.iso25000.it)
coherent terminology, from a syntactic and semantic            [6]    A. Simonetta, A. Trenta, M.C. Paoletti, A. Vetrò,
point of view, also allowing the possibility to update the            Metrics for identifying bias in datasets”, CEUR,
SQuaRE quality models [5]. The AI Regulation contains
                                                                      Workshop Proceedings, 2021
many topics and articles, such as: purpose (art.1), scope
(2), definitions (3), education (4), prohibited AI practices   [7]    ISO/IEC 42001:2023 Information technology -
(5), risk management (9), data and governance (10),                   Artificial intelligence – Management system
documentation (11), deployers (13), human oversight            [8]    D. Natale, Extension of ISO/IEC 25000 Quality
(14), quality managenment (17), obligation of importers               Models to the Context of Artificial Intelligence,
– distributors (23-24), authorities (28), conformity in               CEUR-WS, 2022
third countries (39), harmonized standards (40),               [9]    ISO/IEC 25059:2023 Software engineering -
conformity assessment (43), CE marking (48), AI office                System and software Quality Requirements and
(64), testing (76), fundamental rights (77), etc..                    Evaluation (SQuaRE) – Quality model for AI
                                                                      systems
6. Conclusion                                                  [10]   ISO/IEC TS 12791:2024 Information technology -
The title of this paper is just a tip of the iceberg. The             Artificial intelligence - Treatment of unwanted
topic of data and the whole AI is huge. It is evolving                bias in classification and regression machine
rapidly. The adoption of quality models for data and                  learning tasks
datasets and datasets and the measurement activities are       [11]   ISO/IEC TR 24027:2021 Information technology -
important for a complete quality assurance avoiding bias              Artificial intelligence (AI) - Bias in AI systems and
                                                                      AI aided decision making
and discrimination, supporting useful AI for an efficient
decision support system, avoiding bias and                     [12]   CEN/CLC TR 18115:2024 “Data Governance and
                                                                      quality for AI within the European Context”
discrimination. In addition, it must be considered as a
                                                               [13]   Regulation (EU) 2024/1689 of the European
complement to quality of software to manage the data,
                                                                      Parliament and of the Council of 13 June 2024
including features that are also important such as safety
and transparency. The propagation of SQuaRE quality
models in many standards confirms the value of 25000
family. Likewise, the adoption of a managed quality
system in the organization developing AI systems is
becoming mandatory, paving the way for conformity
assessment and quality certification. It is important to
underline the consistency of standards and compliance
with the law, improving the efficiency of AI systems
production, respecting ethical aspects, human centric
vision, inclusiveness and confiming trustworthiness.

References
[1]   ISO/IEC 25012:2008 Software engineering -
      Software product quality Requirements and
      Evaluation (SQuaRE) - Data quality model
[2]   ISO/IEC 25024:2015 Systems and software
      engineering – Systems and software quality