<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>geo metrical approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Trenta</string-name>
          <email>andrea.trenta@dataqualitylab.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>eigenvalue</institution>
          ,
          <addr-line>A.I., AI, image</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>recognition</institution>
          ,
          <addr-line>training, ISO, ISO/IEC 25024</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>- In a previous paper [13] we discussed ISO/IEC 25000 application when new quality measures are defined. In the present paper: - some quality issues in A.I. are identified, - then known solutions are recalled and - new quality measures for A.I. are proposed data Quality issues in A.I.</p>
      </abstract>
      <kwd-group>
        <kwd>quality</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        In this paper new ISO/IEC 25000 quality measures for
dataset used in some A.I. applications are proposed based
on [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Furthermore, some considerations are
developed about the possible specification and extension of
the method to any kind of dataset.
      </p>
      <p>II.
when referring to Machine Learning.</p>
      <p>
        In this paper, the term A.I. is used for simplicity even
Figure 1 Definitions [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
      </p>
      <sec id="sec-1-1">
        <title>Firstly, we consider the</title>
        <p>A.I.</p>
        <p>
          application
face
recognition, well-known both for the solutions and for the
open issues point of view. Among the open issues there is
how
understand
whether the
training
dataset is
“optimal”. To this end, we will explore the measure of
completeness characteristic [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] of a set of images supposed
to be a training dataset. Note that the measures proposed
neither correspond to the measure of the whole A.I. system
output results nor to its behavior observation, as it is a
purely static measure of the input, although it could be used
together with other ones, to evaluate the overall system
quality.
        </p>
        <p>The</p>
        <p>
          basis of our analysis is the calculation of
“eigenfaces”
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ],
according
        </p>
        <p>Karhunen-Loève
transformation (PCA) with the following steps:
1. Collect M images with nxn grayscale pixels of faces
with similar dimension, light condition, shot, etc.
2. Transform image i (i=1,..M) in a (n2x1) column
vector Γi {Γ1, Γ2,… ΓM};
3. Compute the “average face” Ψ = 1 ∑1 Γ
and subtract Ψ to each image and obtain new vectors
{Φ1, Φ2,… ΦM}

1</p>
        <p>AAT</p>
        <p>4. Build the matrix (n2xM) A = [Φ1, Φ2,… ΦM] and
compute the covariance matrix (n2xn2) C =

1 Φ Φ =
5. Compute M eigenvalues λi (i=1,..M) of matrix ATA
and then eigenvectors of AAT
6. Sort the eigenvalues of C in descending order
7. Choose a number N of eigenvalues, starting from the
biggest, in order to represent 95% of their sum η and keep
them; the other n2-N eigenvalues will not be considered
8. Calculate the images dataset as a linear combination
of the N eigenvectors defined at step 7</p>
      </sec>
      <sec id="sec-1-2">
        <title>Further steps are defined [7],[12],[13] for face recognition, that is out of the scope of this paper</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>III. PROPOSAL</title>
      <p>Intuitively, if we want to measure the completeness1 of
an image dataset, we try to answer questions like:</p>
      <p>how many similar images are there
how strong is the similarity of some images
Here the proposed measures for (a) and (b):</p>
      <p>A. as some dimensions are eliminated in step 7
above 2 , we can measure N\M “PCA space dimension
against dataset space dimension”
B. the more a dataset is orthogonal, the less its images
are similar to each other; as a measure of it, we can
consider the product of N eigenvalues λ1*λ2*...*λN, that is
also the “Determinant of reduced eigenvalues matrix”,
that in turn is the volume of the hyper-parallelepiped that
this matrix represents.</p>
      <p>To sum up, with this proposal we reframe the issue of
finding an effective data quality (for completeness)
measurement function into a well-known geometrical
calculation.</p>
      <sec id="sec-2-1">
        <title>IV. FURTHER STUDIES</title>
        <p>The steps 1-8 above were proven to be effective in face
recognition and are potentially applicable to any dataset. To
do this, the vectorization step 2 above shall be applied to
any attribute(s) of the dataset: as images were vectorized
pixel by pixel3, similar operation could be performed e.g.
for char strings, taking into account possible different
lengths that require a further step of normalization.
Further studies are needed to apply the method also to
images rotated and translated, that is the most frequent case
in A.I. applications (fig.3).</p>
        <p>
          Figura 3 Dataset trial MPEG-CDVA (Compact Descriptor for
Video Analysis)– [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
For the case of a “unsupervised learning”, care should be
taken in generating the appropriate (i.e. minimum) number
of M images, as M appears depending on the kind of dataset
1 It is intended the data quality characteristic “completeness”
see [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
2 N&lt;M [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]
3 The method is agnostic respect to the meaning of images (no
features extraction and semantic categorization); this is a great
simplification that allows to apply the method to other kinds of
dataset
(e.g. we expect different M values for rotated or else
nonrotated images, for personal names or else tags,…); so, in
general, having M&gt;N does not mean that the space is
complete, in other words, that every image can be
represented, e.g. it could be a new M+1th face that cannot
be an acceptable4 linear combination of existing N images
(e.g. a bald face is missing from the dataset of fig.2). If in
this case the new M+1th image is added to the training
dataset, that corresponds to an “enforced learning”.
Therefore, some distinction should be made between a
machine with a “unsupervised learning” and a machine with
“reinforced learning” when evaluating measurements
values (A) and (B) over a training dataset.
        </p>
        <p>As bias is critical for the learning dataset quality, the
measures (A) and (B) are suggested also for bias 5
measurement, when defining bias as the modification of an
ideal fully orthogonal and normalized dataset6.</p>
      </sec>
      <sec id="sec-2-2">
        <title>V. CONCLUSION</title>
        <p>
          The measures (A) and (B) appear belonging to data
quality completeness characteristic [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Applying the
process described in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], the measures (A) and (B) can be
defined as ISO 25000 conforming measures; they can be
considered in SC7 WG6 and SC42 work in progress on A.I.,
too.
4 i.e. the projection of the M+1th image in the space of faces has
an euclidean distance with respect to the other faces below a
threshold; in the case the M+1th image is a copy, there is one
image in the M-dataset from which the Euclidean distance is 0
5 For “bias” it is intended the prevalence of some values in an
attribute (e.g.: “male” value in “gender” attribute)
6 Further measures could refer to Hilbert space frame theory
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] ISO/IEC 25010:
          <article-title>2011 Systems and Software engineering - Systems and software Quality Requirements and Evaluation (SQuaRE) - System and software quality models</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2] ISO/IEC 25012:
          <article-title>2008 Systems and Software engineering - Systems and software Quality Requirements and Evaluation (SQuaRE) - Data quality model</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3] ISO/IEC 25020:
          <year>2019</year>
          ,
          <article-title>Systems and Software engineering - Systems and software Quality Requirements and Evaluation (SQuaRE) - Quality measurement framework</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4] ISO/IEC 25022:
          <year>2016</year>
          ,
          <article-title>Systems and Software engineering - Systems and software Quality Requirements and Evaluation (SQuaRE) - Measurement of quality in use.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5] ISO/IEC 25023:
          <year>2016</year>
          ,
          <article-title>Systems and Software engineering - Systems and software Quality Requirements and Evaluation (SQuaRE) - Measurement of system and software product quality</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6] ISO/IEC 25024:
          <year>2015</year>
          ,
          <article-title>Systems and Software engineering - Systems and software Quality Requirements and Evaluation (SQuaRE) - Measurement of data quality</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Turk</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pentland</surname>
            ,
            <given-names>A.P.</given-names>
          </string-name>
          :
          <article-title>Face recognition using eigenfaces</article-title>
          .
          <source>In: Computer Vision and Pattern Recognition</source>
          ,
          <year>1991</year>
          .
          <source>Proceedings CVPR'91</source>
          . IEEE Computer Society Conference on. pp.
          <fpage>586</fpage>
          -
          <lpage>591</lpage>
          . IEEE (
          <year>1991</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Natale</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Trenta: Examples of practical use of ISO/IEC 25000 Proceedings APSEC IWESQ 2019 (CEUR-WS.org</article-title>
          ,
          <source>ISSN 1613- 0073)</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>[9] Compact Descriptors for Video Analysis: the Emerging MPEG Standard 2017 Ling-Yu Duan</article-title>
          , Vijay Chandrasekhar, Shiqi Wang, Yihang Lou, Jie Lin, Yan Bai, Tiejun Huang, Alex Chichung Kot, Fellow, IEEE and
          <string-name>
            <given-names>Wen</given-names>
            <surname>Gao</surname>
          </string-name>
          , Fellow, IEEE
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>[10] ISO/IEC 15938-15:2019 Information technology - Multimedia content description interface - Part</source>
          <volume>15</volume>
          :
          <article-title>Compact descriptors for video analysis</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Strohminger</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gray</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chituc</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hener</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schein</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heagins</surname>
            <given-names>T.B.</given-names>
          </string-name>
          :
          <article-title>The mr2: A multi-racial, mega-resolution database of facial stimuli</article-title>
          .
          <source>Behavior research</source>
          methods pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Sirovich</surname>
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kirby</surname>
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Low-dimensional procedure for the characterization of human faces</article-title>
          .
          <source>Josa a 4</source>
          (
          <issue>3</issue>
          ),
          <volume>519</volume>
          {
          <fpage>524</fpage>
          (
          <year>1987</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Bagli</surname>
            <given-names>M. C.</given-names>
          </string-name>
          (
          <year>2016</year>
          )
          <article-title>Autovettori e riconoscimento facciale</article-title>
          , Università di Bologna, https://amslaurea.unibo.it/12063/
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>[14] https://www.fokus.fraunhofer.de/en/fame/workingareas/ai</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>