<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Data quality certification using iso/iec Journal on Advanced Science</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/ICCSCE.2014.7072735</article-id>
      <title-group>
        <article-title>Completeness for the Prediction of Discrimination</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessandro Simonetta</string-name>
          <email>alessandro.simonetta@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tsuyoshi Nakajima</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Cristina Paoletti</string-name>
          <email>mariacristina.paoletti@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessio Venticinque</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering Shibaura Institute of Technology</institution>
          ,
          <addr-line>Tokyo</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Enterprise Engineering University of Rome Tor Vergata</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Naples</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>3114</volume>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Data has assumed increasing importance within the global economy, and its use is becoming more pervasive in multiple contexts. However, learning systems are exposed to various critical issues that can be addressed through ISO standards. Indeed, machine learning (ML) models may be exposed to the risk of perpetrating societal prejudice simply because the same bias exists in the data. Based on these notions, we have build a model to identify similar treatment groups based on the type of classification errors made by ML algorithms. A way to calculate fairness indices on the protected attributes of the dataset will be illustrated in the article. Finally, we will consider the degree of relationship existing between maximal completeness and fairness of forecasting algorithms through an inverse procedure of constructing a complete dataset. The use of mutual information provided an alternative method for calculating synthetic fairness indices and a useful basis for future research.</p>
      </abstract>
      <kwd-group>
        <kwd>fairness</kwd>
        <kwd>machine learning</kwd>
        <kwd>maximum completeness</kwd>
        <kwd>treatment similarity</kwd>
        <kwd>mutual information</kwd>
        <kwd>entropy</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Data has become increasingly important within the
global economy, and its use, which often occurs through
sive in many areas.</p>
      <p>The Economist [1] was one of the first to define data the
oil of the modern age. With the rise of Artificial
Intelligence (AI) algorithms in decision support, data quality
has become always more important, therefore Forbs [2]
points to data as the fuel of ML algorithms. Consequently,
a new business has emerged based on their collection and
sale. WEB giants such as Google ofer free services and
products with the target of collecting information often
and companies earn considerable sums from selling the
information rather than from payment services. This has
pushed these companies to use increasingly sophisticated
technologies [3] and algorithms to collect information
and integrate it with those from other data sources to
maximize their insights. In addition, as presented in the
documentary ”The Social Dilemma” [4] information about
users, including contacts and interactions on platforms
Woodstock’21: Symposium on the irreproducible science, June 07–11,
nEvelop-O
[8]. It is worth mentioning that also in the General Data
Protection Regulation (GDPR) 2016/679 [9], defined to
harmonize the data privacy laws among the European
countries, there are data quality notions such as accuracy,
timeliness and security. The same could be found in the
European regulation Solvency II [10], which states the
need for insurance companies to have internal procedures
and processes in place to ensure the appropriateness.</p>
      <p>As we mentioned in [11] we believe that a good
solution to ensure the correct use of data and their
quality according to regulation and ethics values is the
compliance to ISO standards: ISO/IEC 27000 [12], ISO
31000 [13] e ISO/IEC 25000 [14]. The introduction of
maximum completeness, as dataset balance index, and
its relation to fairness metrics are emanations of the
of SQuaRE approach in measuring data quality and
assessing its implications.</p>
    </sec>
    <sec id="sec-2">
      <title>2. The Present Situation</title>
      <p>Although it is dificult to estimate the cost of the absence
of quality in data, a primary goal for organizations (public
and private) that base their business on the digitization
of processes and the operation of the organization itself
is to have trusted data [15]. Some experiences show
how the application of the SQuaRE series is a solution
for measuring and monitoring data quality over time. In
Italy, the first indication towards public administration
managing databases of national interest was in 2013, in
fact the Agency for Digital Italy (AgID) had identified in
the ISO /IEC 25012 standard the data quality model to be
adopted [16].
and information presentation (quantity of data presented
to the user and order of priority) issues that may afect
the fairness of computing systems. Although these issues
are related to the biases within the data, characteristics of
recommender systems can introduce a greater degree of
uncertainty. These are related to the permissions of the
users who use them to access the information or the size
of the data that can be processed by the algorithms. This
makes it even more dificult to find countermeasures to
avoid discrimination.</p>
      <p>Finally, in [26] the authors show a methodology for
identifying critical attributes that can lead to
discrimination by classification-based learning systems.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Solution Proposed</title>
      <sec id="sec-3-1">
        <title>Since in 2013, AgID had identified within the 15 quality</title>
        <p>When using an ML-based recommendation system on a
characteristics, those that should be inescapably used
dataset where bias is present, the bias propagates within
(accuracy, consistency, completeness, and newness) for
the model itself, replicating the guesswork and prejudices
databases of national interest. In the three-year plan for
in the data. So, we run the risk of thinking that we applied
public administration information technology 2021-2023
an objective and neutral evaluation system, while we are
[17], AgID confirms increasing data and metadata quality
using a biased system within an AI algorithm.
as a strategic goal (OB2.2).</p>
        <p>One of the purposes of this research is to verify that
In [18] are reported three case studies of data quality
the system behaves in a non-discriminatory way toward
evaluation and certification process about repositories.</p>
        <p>The diferent visions are analyzed to evaluate the impact
certain groups. By considering the diferent fairness
measures in [27], it is possible to calculate their value with
of the adoption of the ISO/IEC 25012, ISO/IEC 25024
respect to two groups, identified by a protected attribute,
and ISO/IEC 25040 and their benefit recognized in the
three organization before and after the process. The
results show that applying their methodology helps the
organization to get a better sustainability in the long
term, improve the knowledge of the business and drive
the organizations in better data quality initiatives for the
future.</p>
        <p>Among the environments in which the above ISO
standards can be most useful are undoubtedly those where
the information contains sensitive or safety data [19]
such as the healthcare and legal domains. An example
is the proposed OpenEHR standard in [20]. The issues
that touch clinical records from the perspective of data
quality are presented in [21]. In [22] the authors propose
a generalized model for big data: a solution based on the
application of ISO/IEC 25012 and ISO/IEC 25024. The
study introduces three data quality dimensions:
Contextual Consistency, Operational Consistency and Temporal
Consistency. In [11] the authors show how using the
SQuaRE series can ensure GDPR compliance. In [23] the
study examines discrimination against nonwhite
teachers who are present on online English language teaching
platforms.</p>
        <p>One possible solution to the problem that bias in the
data can propagate into the inferences of ML algorithms
is through the dataset labeling mechanism presented in
[24]. In [25] the authors present a range of fair access
to see if there are any disparities in treatment. For
example the formal criterion of Independence requires that
the sensitive attribute A would be statistically
independent of the predicted value R and this could be calculated
∀,  ∶  ≠ 
as:
 ( = 1| = 
 ) =  ( = 1| = 

)</p>
        <p>
          (
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
        </p>
        <p>To understand whether an attribute is a cause of
discrimination in prediction outcomes, that is, whether there
are homologous treatment groups, it is necessary to know
the attribute’s level of fairness. Ideally, therefore, should
be better to have a single measure that gives an idea of
how likely that attribute is to lead to discrimination.</p>
        <p>In [26], the authors propose a method to compute
several synthetic indices related to the fairness of the
classification system. Two diferent methods are described in the
article: the first performs clustering with</p>
        <p>DBSCAN and
Kmeans methods while the second, MaxMin, searches
for the worst case by dividing the protected attribute
instances into privileged and unprivileged. Both
methods allow grouping the elements of a protected attribute
according to the type of treatment. In this way the
calculation of the synthetic index is based on a few influence
classes returning to the definition from which we started
[27]. These two approaches were used to test for a link
between the notion of maximum completeness and
fairness indices. This would allow a priori identification of
whether learning on a present dataset can lead to mi- the issue becomes more complicated when there are
catnority discrimination. At this point, alternative methods egorical attributes with higher cardinality as the number
are proposed to identify homogeneous treatment groups of relations increases. With reference to the Juvenile
with respect to the result obtained from a classification dataset [28], considering the V3_nacionalitat attribute
system. Algorithms may err toward some groups equally, representing the nationality of the students, it is possible
i.e. for African-Americans and Native Americans they to draw the phenomenon through a subway diagram (Fig.
may give a degree of recidivism in excess of what hap- 1). In this graph, it is easier to check intersections
bepens in reality. tween sets. For example, Group 0 and Group 1 have the
element Colombia in common. The top histogram shows
3.1. Identification of homogeneous the number of elements participating in the intersection
while the left histogram shows the number of elements
treatment groups in the group.</p>
        <p>To start, we need to calculate the fairness indices reported The result obtained with the Pearson coeficient
threshin [27] for the protected attributes of the dataset, consid- old of 0.9 identifies twelve homogeneous treatment
ering the predictions of the classification algorithm and groups. In order to reduce their number, we kept as
the actual corrected result. a representative of a set of groups the one that contained</p>
        <p>We refereed to a classic case study for this kind of them in the inclusion relation. This reduced the twelve
problem: the Compas dataset [7], where we observed to four completely disjointed groups.
a similar trend between groups. Table 1 shows the
values of the 6 fairness indices for the protected attribute
Race: Independence (Ind), Separation True Positive Rate
(SepTPR), Separation False Positive Rate (SepFPR),
Sufifciency Positive Predictive Value (SufPPV), Suficiency
Negative Predictive Value (SufNPV) and Overall
Accuracy Equality (OAE).</p>
        <p>Table 2 shows the correlation matrix according to
Pearson’s coeficient and the existence of correlation between
the indices measured for diferent ethnicities.
Considering a correlation value of 0.9, it is easy to detect the
existence of two treatment groups (Table 3): G0 and G1. Figure 2: Scatterplot of races in Compas Dataset, mean of</p>
        <p>Although this method works well for the case study, fairness metrics Vs maximum completeness</p>
        <sec id="sec-3-1-1">
          <title>3.2. Relationship between mean of</title>
          <p>fairness indexes and  
At this point, we studied if there was a relationship
between the composition of the groups made using Pear- 3.3. Alternative synthetic indices
son’s coeficient and the maximum completeness, as The presence of outliers in the values of fairness indices
shown in the [29], [30] studies. For this purpose, we related to a protected attribute could impact the valuation
used the scatterplot diagram in which each ethnicity was of these parameters. For this reason, in this paper we
drawn in relation to the pair of values: mean of the fair- propose a diferent way of calculating fairness indices. In
ness indices, in the abscissa, and maximum completeness, this research, we calculate independence, separation,
sufin the ordinate (Fig. 2). Considering the positioning of ifciency and OAE using the notion of entropy and mutual
the diferent ethnic groups and a scale that reports the information. The idea is to find a new representation of
highest value as the limit of the diagram, we observe the synthetic index that would allow more confident
identhat they tend to cluster on average around the grand tification of whether a given protected attribute could
mean of the fairness attributes, most noticeably when we lead to possible discrimination. Considering the
condilook at the privileged group. Items belonging to the same tion of Independence between two groups  =   and
dgeroxu.pThteesnedctoonsriedmeraaitniocnlsosaererleelsastitvreuetofotrhtehefamiranxeismsuimn-  =   :
completeness index (  ).</p>
          <p>After extending the analysis to the diferent attributes
of the datasets already present in [26] [31],   seems
to be a strongly characterizing parameter, more so than
the other indices proposed in [32]. In fact, repeating
the analysis on other protected attributes, such as
V3_nacionalitat of the Juvenile dataset, within the
scatterplot the clustering of similarly treated elements
was found to be strongly related not only to the average
of the fairness indices, but also to the   as present in
Fig. 3, considering the groups with intersection present
in 1.
| ( = 1| = 
 ) −  ( = 1| ≠</p>
          <p>
            )| &lt; 
This condition can be extended to all categories of
suficiency is expressed by the following equation:
 ( , |) =  ( , ) +  (, ) −  ( , , ) −  ()
the protected attribute and also expressed by orthogo- finally, the OAE is computed by:
nality between the predicted value R and the group A
through mutual information. Given two variables, they
are independent if their mutual information is zero:
 (, | = ) =  (,  = )+
 ))
 ))
(
            <xref ref-type="bibr" rid="ref2">2</xref>
            )
(
            <xref ref-type="bibr" rid="ref3">3</xref>
            )
(
            <xref ref-type="bibr" rid="ref4">4</xref>
            )
(
            <xref ref-type="bibr" rid="ref5">5</xref>
            )
(
            <xref ref-type="bibr" rid="ref6">6</xref>
            )
(
            <xref ref-type="bibr" rid="ref7">7</xref>
            )
(
            <xref ref-type="bibr" rid="ref8">8</xref>
            )
are:
as:
H(A) is the entropy associated to A and it is calculated
ifnally, the third term H(A,R) is computed by:
 (, ) =
 (  ∩   )( (
          </p>
          <p>∩   ))
The other indices can also be expressed by mutual in- study, we performed the comparison of the three
methodformation and in particular referring to [33] and [26]</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Separation is calculated by:</title>
        <p>(, | ) =  (,  ) +  (,  ) −  (,  , ) −  ( )
to the fairness measures, without loss of generality, we
have reported only the relationships between Indepen- identified similarities that previously remained hidden
dence measure and maximum completeness. In Fig. 4 in in search of possible discrimination.
red is shown the dependence curve related to MaxMin The other achievement was that we were able to
assomethodology, in blue that with DBSCAN and in black ciate fairness measures with protected attributes,
indethat with mutual information. The graph, highlighted in pendently of those of individual values, using the concept
Fig. 4 shows the trend of independence versus varying of mutual information and entropy. This approach laid
maximum completeness. The process of construction the foundation for new experimentation to relate the
of the dataset initially select few tuples of the original response of these measures to changes in maximum
comone (  =0.324) and after insert new tuples until the pleteness.
dataset reaches the overall completeness (  = 1), Finally, we compared the classical approaches [31]
which corresponds to maximum independence. versus the method using mutual information and</p>
        <p>The curve related to the MaxMin method initially hires entropy. In this way, we tested the response of fairness
greater values than the other two methods, while the measures against maximum completeness and found
phenomenon decreases as the number of records entered confirmation against the premises of the work, namely,
increases. Thus, we can conclude from the present re- that non-quality in the data leads to unfair treatments if
search that there is a greater sensitivity of independence AI and ML are used in the decision-making process of
measure with respect to varying maximum completeness recommender systems.
if the MaxMin method is used.</p>
        <sec id="sec-3-2-1">
          <title>3.4. Limit and Future Works</title>
          <p>This work identified homologous treatment groups
using Pearson’s coeficient, which detects the correlation
between fairness characteristics associated with diferent
groups.</p>
          <p>In the future, further research should be done to
investigate new similarity mechanisms based on ML and
Deep Learning algorithms considering other
clustering methodologies that can avoid overlapping between
groups.</p>
          <p>A second line of research will aim to identify
discrimination caused by belonging to more than one protected
attribute such as gender and race simultaneously.</p>
          <p>Since we do not considered explainable AI algorithms,
future works could be extended considering framework
that analyze how AI models make decisions (i.e. Watson
OpenScale [34]).</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>The use of AI and ML in the decision-making process
of many recommendation systems makes it possible to
mitigate the risk of subjective classifications.</p>
      <p>While these systems are reliable forecasting tools, they
do not always allow for an explanation of why such
conclusions were reached. Thus, the presence of incomplete
or unbalanced data, that can be measured through the
SQuaRE series (completeness measures), can lead to
biased results.</p>
      <p>This work made it possible to us, to calculate
similar groups in terms of equivalence of treatment through
the application of Pearson’s coeficient to synthetic
indices related to protected attributes. In such a way, we
software bias, CEUR Workshop Proceedings (2021)
pp. 17–22.
[31] A. Vetrò, M. Torchiano, M. Mecati, A data quality
approach to the identification of discrimination risk in
automated decision making systems, Government
Information Quarterly 38 (2021) 101619. doi:https:
//doi.org/10.1016/j.giq.2021.101619.
[32] A. Simonetta, M. C. Paoletti, M. Muratore, A new
approach for designing of computer architectures
using multi-value logic, International Journal on
Advanced Science, Engineering and Information
Technology 11 (2021) 1440–1446. doi:10.18517/
ijaseit.11.4.15778.
[33] D. Steinberg, A. Reid, S. O’Callaghan, F. Lattimore,
L. McCalman, T. S. Caetano, Fast fair regression
via eficient approximations of mutual information,
CoRR abs/2002.06200 (2020). URL: https://arxiv.org/
abs/2002.06200.
[34] IBM, Watson openscale, 2022. URL: https:
//www.ibm.com/it-it/cloud/watson-openscale/
drift(Access10-22).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>The</given-names>
            <surname>Economist</surname>
          </string-name>
          ,
          <article-title>The world's most valuable resource is no longer oil, but data, The Economist</article-title>
          ,
          <source>USA (6th May</source>
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Marr</surname>
          </string-name>
          ,
          <article-title>The 5 biggest data science trends in 2022, Oct 2021</article-title>
          . URL: https: //www.forbes.com/sites/bernardmarr/2021/ 10/04/the-5
          <article-title>-biggest-data-science-trends-in-2022/ ?sh=22f5fc1d40d3.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Giuliano</surname>
          </string-name>
          ,
          <article-title>The next generation network in 2030: Applications, services, and enabling technologies</article-title>
          ,
          <source>in: 2021 8th International Conference on Electrical Engineering, Computer Science and Informatics (EECSI)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>294</fpage>
          -
          <lpage>298</lpage>
          . doi:
          <volume>10</volume>
          .23919/ EECSI53397.
          <year>2021</year>
          .
          <volume>9624241</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Orlowski</surname>
          </string-name>
          ,
          <article-title>The social dilemma</article-title>
          ,
          <source>Sep</source>
          .
          <year>2020</year>
          . URL: https://www.netflix.com/it/title/81254224.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G. C.</given-names>
            <surname>Cardarilli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Di</given-names>
            <surname>Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fazzolari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Giardino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Re</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ricci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Spanò</surname>
          </string-name>
          ,
          <article-title>An fpga-based multi-agent reinforcement learning timing synchronizer</article-title>
          ,
          <source>Computers and Electrical Engineering</source>
          <volume>99</volume>
          (
          <year>2022</year>
          )
          <article-title>107749</article-title>
          . doi:https://doi.org/10.1016/j. compeleceng.
          <year>2022</year>
          .
          <volume>107749</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G. C.</given-names>
            <surname>Cardarilli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Re</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Di</given-names>
            <surname>Nunzio</surname>
          </string-name>
          ,
          <article-title>A pseudosoftmax function for hardware-based high speed image classification, Scientific Reports (</article-title>
          <year>2021</year>
          ).
          <source>doi:10.1038/s41598- 021- 94691- 7.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mattu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kirchner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Angwin</surname>
          </string-name>
          , Compas recidivism dataset,
          <year>2016</year>
          . URL: https://www.propublica.org/article/ how
          <article-title>-we-analyzed-the-compas-recidivism-algorithm.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>[8] Council of Europe, Recommendation CM/Rec(</article-title>
          <year>2020</year>
          )
          <article-title>1 of the Committee of Minis-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>