<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How May Deep Learning Testing Inform Model Generalizability? The Case of Image Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giammaria Giordano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valeria Pontillo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giusy Annunziata</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Cimino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Filomena Ferrucci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Palomba</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Software Engineering (SeSa) Lab - Department of Computer Science, University of Salerno</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Artificial intelligence (AI) has become increasingly popular and is used in various fields, particularly image recognition. Several studies use images to train self-driving car models, security monitoring systems, recognize signals, etc. However, the approach taken to design and evaluate AI models can significantly afect the resulting performance of the models during operation. Hence, applying a rigorous approach to the design and evaluation of AI models may become crucial: this is the ultimate goal of the research field of Software Engineering for Artificial Intelligence . While current literature on image recognition proposed AI pipelines achieving good performance, it is still unclear how they would work in a real environment, where additional social and environmental factors come into play. In this paper, we propose a preliminary investigation into the role of input testing as a early indicator of the real-world performance of deep learning models in the context of image recognition. By taking the well-known Fashion-MNIST dataset into account, we first design a Convolutional Neural Network able to recognize images, in an efort of replicating the work done in previous studies and establishing a baseline. Then, we propose the use of input testing to simulate real-case conditions. Our preliminary results show that the devised CNN can lead to precision, recall, F-Measure, and accuracy close to 90%, hence confirming the results of previous experimentation in the ifeld. Nonetheless, when input testing is applied, the performance of the model drastically drops (reaching ≈ 30%), possibly highlighting the need for revisiting image recognition models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Empirical Software Engineering</kwd>
        <kwd>Software Engineering for Artificial Intelligence</kwd>
        <kwd>Deep Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Unfortunately, all that glitters is not gold: despite the</title>
        <p>promising results obtained in previous studies, the
apSoftware Engineering for Artificial Intelligence (SE4AI) plicability of these models in a real-world scenario still
refers to the use of software engineering principles to seems to be quite limited today due to external conditions,
manage complex Artificial Intelligence (AI) models in e.g., environmental factors, which can render the systems
order to rigorously test and ensure their scalability, in- unsuitable. As an example, Beede et al. [6] investigated
teroperability, and maintenance over time [1]. Hence, the prediction correctness of deep learning models for
the principles of SE4AI could be applied with the aim of diabetic eye disease with a strong performance in in-vitro
developing efective, eficient, reliable, and sustainable experiments. Their results indicated poor performance
AI models. In the last years, researchers and practition- due to socio-environmental factors that impacted the
ers have been focusing on object recognition and image in-vivo experimentation. This study suggests that an
classification, developing a large amount of AI systems improved assessment of these models would inform the
with good performance [2, 3]. The reason behind this design of efective solutions that may reach good
perforchoice is related to the availability of large datasets of mance when employed in production.
images, e.g., Fashion-MNIST or MNIST datasets, that can For this reason, this paper proposes a preliminary
inbe applied in various studies spanning diferent fields, vestigation into the ecological validity [7] of AI models
e.g., from healthcare to self-driving cars [4, 5]. proposed in the context of image recognition, namely
we aim to understand how generalizable the
experimenSATToSE’23: 15th Seminar Series on Advanced Techniques &amp; Tools for tal results previously presented would be in a real-case
Software Evolution, June 12–14, 2023, Fisciano, Italy scenario. More specicfially, starting from the
Fashion*$Cogriraegsipoordnadninog@auuntihsoar.i.t (G. Giordano); vpontillo@unisa.it MNIST dataset, we first built a Convolutional Neural
Net(V. Pontillo); gannunziata@unisa.it (G. Annunziata); work (CNN) using software engineering principles, hence
a.cimino10@studenti.unisa.it (A. Cimino); ferrucci@unisa.it conducting in-vitro experimentation in an efort of
cor(F. Ferrucci); fpalomba@unisa.it (F. Palomba) roborating previous results and establishing a baseline.</p>
        <p>0000-0003-2567-440X (G. Giordano); 0000-0001-6012-9947 Then, we apply input testing [8], with the aim to
un(0V0.0P0-o0n0t0il2l-o0)9;7050-0899-70200(2F-.0F7e4r2r-u7c2c6i)1; (0G0.00A-n00n0u1n-z9i3a3t7a-);5116 derstand to what extent the training set data fit the AI
(F. Palomba) model, i.e., altering the inputs of the model to simulate
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License an in-vivo experimentation [9].</p>
        <p>CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)</p>
        <p>On the one hand, our preliminary findings corroborate
the image recognition performance reported in literature
when considering the in-vitro experimentation: indeed,
the engineered CNN developed reached levels of
precision, recall, F-Measure, and accuracy close to 90%. On
the other hand, we discover that the performance of the
same model drastically drops when input testing is
applied, hence suggesting that (1) the currently available
models would not properly work in practice and (2) input
testing may provide insights to machine learning
engineers on the generalizability of the model in practice,
hence possibly informing their design actions.</p>
        <p>Structure of the paper. Section 2 overviews the
background and the state of the art by pointing out the main
diferences between our work and the literature. Section
3 overviews the research questions driving our study
and the research method, while Section 4 discusses our
preliminary results. Finally, Section 5 summarizes the
highlights of this work and outlines our future work.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <sec id="sec-2-1">
        <title>This section describes the background and the related work that are the foundations of our proposed approach.</title>
        <sec id="sec-2-1-1">
          <title>2.1. Background</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Most of the research conducted on image recognition</title>
        <p>relied on the so-called Fashion-MNIST dataset.1 This is
the reason why the research presented in the remainder
of this paper focuses on understanding the performance
of a deep learning solution on this dataset. In particular,
Fashion-MNIST is a clothes dataset based on the
assortment on Zalando’s website proposed by Xiao et al. [10].
It is considered a benchmark dataset containing images
with the following characteristics: (1) all instances are
normalized in a dimension of 28x28 pixels; (2) the images
are preprocessed and converted into a gray scale; and (3)
each pixel is composed of a value ranging from 0 to 255
based on the color intensity. The dataset contains over
70,000 examples of t-shirts, dresses, and so on, split into
two sets, the training that contains 60,000 images and
the test with 10,000 instances. In addition, the dataset is
divided into 10 classes, one for each clothes category, e.g.,
t-shirts, trousers, and pullovers. Figure 1 shows some
images from the dataset.</p>
        <sec id="sec-2-2-1">
          <title>2.2. Related Work</title>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>In the context of object detection and image classification [11, 12], Fashion-MNIST dataset appears in the top</title>
        <p>10 most datasets used for several purposes, e.g.,
investigation privacy [13] issues. Xiao et al. [10], the authors of
Fashion-MNIST dataset, compared several classifiers, e.g.,
Decision Tree and Extra Tree Classifier , and they achieved
performance on average around 80% in terms of accuracy.</p>
        <p>In 2021, Leithardt [14] performed a comparison between
diferent classification methods e.g., Support Vector
Classification , Linear Support Vector Classification , and
various Convolutional Neural Network approaches. The best
classification model was CNN-dropout-3, with accuracy
above 99%, while the worst model was Gaussian Naive
Bayes with accuracy around 51%. Saquib and Zahra [15]
proposed an improvement of the Adam algorithm [16]
named Mean-ADAM meant to reduce the oscillation of
the weights—which is usually considered the general
problem that makes the accuracy fall—and outperforms
all other adaptive gradient methods until final training.</p>
        <p>A 2 stochastic optimization algorithm is used, by which
the variance of the weights might increase during the
optimization, therefore, there is a progressive reduction
of the external weight that will improve the accuracy,
generalizability, and data set invariance. Their results
showed good accuracy (reaching ≈ 90%) for several neural
networks, e.g., ResNet, VGGNet, and Inception V1.</p>
        <p>Greeshma et al. [17] presented the classification of
Fashion-MNIST dataset using a Multiclass Support Vector
Machine (SVM); their results showed an accuracy above
86%. Similarly, Xhaferra et al. [12] used deep learning
models in e-commerce to solve problems related to
clothing recognition. The authors developed a neural
network with an accuracy of 93.1%. Bhatnagar et al. [18]
proposed three diferent convolutional neural network
architectures using batch normalization and residual skip
connections, reaching 90% accuracy. Finally, Kayed et
al. [19] built upon previous studies and improved CNN’s
performance by leveraging a LeNet-5 architecture. In this
way, the authors achieved 98% accuracy.</p>
        <p>While experimenting with multiple shallow and deep
1The Fashion-MNIST dataset: https://github.com/zalandoresearch/ learning solutions, most of the studies discussed above
fashion-mnist
2.3. Limitations of the State of the Art
reported the models based on Convolutional Neural Net- work in a real-world scenario. The perspective is of both
work (CNN) as the best solutions fitting the problem of researchers and practitioners; the former are interested in
image recognition. This aspect informed the design of assessing the current state of the art, hence
understandour experiment, which indeed investigates the in-vitro ing how software engineering practices can assist the
and in-vivo performance of a CNN model. development of AI solutions. The latter are interested in
evaluating the capabilities of AI models in a real-context
scenario. Based on the previous considerations, we ask:</p>
      </sec>
      <sec id="sec-2-4">
        <title>2https://scikit-learn.org/stable/modules/neural_networks_</title>
        <p>supervised.html
By analyzing the state of the art, we highlighted a number Û RQ1. What is the performance of an engineered
of challenges for the Software Engineering for Artificial Convolution Neural Network when applied for the task
Intelligence (SE4AI) research community. The interested of image recognition?
reader might have a full overview of the current
challenges in the field through the systematic literature
review conducted by Giordano et al. [13].</p>
        <p>First, we observed that previous work assessed the Û RQ2. To what extent the application of input testing
proposed approaches only through in-vitro experimen- methods impact the performance of an engineered
Contation, hence investigating the performance of machine volution Neural Network when applied for the task of
and deep learning models in terms of performance in- image recognition?
dicators computed when running them against datasets
using validation strategies such as percentage split or Figure 2 shows the research method applied to
ancross-fold validation. On the contrary, to the best of our swer our research questions. Specifically, to address
knowledge, there is no study that attempted to provide
RQ1, we developed a Convolutional Neural Network
indications of the ecological validity of the models. (CNN) using Scikit-learn2 and applied it on the
Fashion</p>
        <p>In addition, most studies only experimented with the MNIST dataset. We focused on this dataset because
preaccuracy metric [20], namely the total amount of cor- vious studies have shown that Fashion-MNIST is
considrect predictions made by a model. However, the use of ered one of the most used datasets in image recognition
accuracy can cause multiple biases. In the first place, [13]. The popularity of the Fashion-MNIST dataset
dethe accuracy does not consider the distribution of the pends on its features reported in Section 2.1.
training and test sets. Suppose the training data is sig- We decided to re-evaluate a CNN to assess the
statenificantly diferent from the test data. In that case, the of-the-art and to avoid possible biases due to diferent
accuracy metric can be biased due to the learning efect environments, configurations, and library versions. We
where the model memorizes the training data instead of trained the algorithm applying the data augmentation,
learning the true underlying data model. In the second i.e., a technique that increases the data available by
modplace, although accuracy is one of the most analyzed met- ifying the initial images with filters to change the color
rics for understanding the efectiveness of an AI model, palette to permit us to increase the original dataset size
it is not an appropriate measure for unbalanced datasets from 60,000 entries to 300,000. Finally, we divided the
since it does not distinguish between the numbers of dataset by 85% for the training set and 15% for the test set.
correctly classified examples of diferent classes, leading To understand the performance of our model, we
evaluto erroneous conclusions [21]. For this reason, it may be ated the approach with a number of state-of-the-art
metmore appropriate to consider other evaluation metrics rics, i.e., precision, recall, F-Measure, and accuracy [22].
such as F-Measure and recall to assess the AI models. Once we had established a baseline, we proceeded</p>
        <p>In this work, we aim at addressing the two limita- with RQ2, where we focused on the potential behavior
tions above. We indeed devised a baseline CNN model to of the CNN in a real-world context. Specifically, we
apclassify images that we first assessed through multiple plied input testing methods [8] to analyze the training
performance indicators. Afterward, we experimented data used to train the model, with the aim of identifying
with input testing to investigate the potential ecological potential issues in the training set data. Hence, we
crevalidity of the model in a real-case scenario. ated customized instances of the Fashion-MNIST dataset
by introducing diferent noises on the test set data to
3. Research Method simulate a real-world scenario, e.g, rain or fog. For our
preliminary evaluation, we applied a cut filtering on the
Fashion-MNIST dataset to simulate the scenario in which
images are not perfectly aligned to the center. Figure 3</p>
      </sec>
      <sec id="sec-2-5">
        <title>The ultimate goal of this study was to apply input testing</title>
        <p>methods to verify the behavior of a deep neural network
model built in the context of image recognition, with the
purpose of analyzing how the model would potentially
Filtering
Convolutional
Neural Network
Answer RQ2
Custom 
F-MNIST
Custom 
F-MNIST
. 
. 
.
Custom 
F-MNIST
Fashion MNIST</p>
        <p>Data</p>
        <p>Augmentation
shows an example of the application of this filter on
Fashion-MNIST dataset: for each clothes category, the
cut was made vertically in the center of the image, so
the garment is not fully visible. The application of this
iflter is useful for simulating low visibility conditions,
e.g., trafic signs that are not fully visible in the context
of self-driving cars. We then re-assessed the approach in
terms of precision, recall, F-Measure, and accuracy [22].</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Preliminary Results</title>
      <sec id="sec-3-1">
        <title>The following sections describe the preliminary results achieved to address the two research questions.</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4.1. RQ1 - Replicating Previous</title>
      <p>Experiments
model achieves good performance (above 90%), going
to confirm the results already shown in the literature
[11, 12, 14]. Analyzing the other metrics, we can also
see that the performance is always very positive (again
above 90%) for each epoch and batch size considered. To
conclude, our replication found results similar to those
reported in previous experiments, hence confirming that
a CNN approach can efectively recognize images when
applied against the Fashion-MNIST dataset.</p>
      <p>ø Key findings of RQ 1.</p>
      <sec id="sec-4-1">
        <title>Our replication study corroborates previous findings</title>
        <p>in the field of image recognition through AI. The
performance of the CNN model is over 90% in terms
of accuracy, F-Measure, Recall, and Precision.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4.2. RQ2 - On the Impact of Input Testing</title>
      <p>classes. This could happen because, although the
elements were cut in half, they still preserved elements that
make them always distinguishable.</p>
      <p>Finally, Table 2 reports the results for the CNN model
after the application of the cut filtering. In this case, we
can observe a severe decrease of all metrics, especially
for the F-Measure, that reached no more than 27% against
the previous 93%.</p>
      <p>In addition, also the other metrics, i.e., Precision,
Recall, and Accuracy, do not reach values above 40%. These
results suggest that when the model cannot consider the
entire image of a garment, then it may have a large loss
of information, e.g., on the shape, which leads to lower
performance. While further analysis is required to
understand how usable and generalizable deep learning models
are in a real-world context, our findings suggest that (1)
existing models would not properly work in conditions
where the images are not perfectly passed as input; and
(2) input testing seems to be a valid instrument to
establish the performance of AI models, possibly informing
machine learning engineers and data scientists on the
need for taking further actions.</p>
      <p>ø Key findings of RQ 2.</p>
      <p>Our preliminary results indicated that the
application of input testing methods lets the performance of
the CNN decrease up to 60% with respect to what
reported in literature. The overall performance ranged,
indeed, between 19% to 33% in terms of precision,
recall, F-Measure, and accuracy.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <p>This paper provided a preliminary analysis of how
existing deep learning solutions work when they are
experimented in seemingly real conditions through the
application of input testing methods. We first replicated the
decFuigt ufirlteer5:toCtohnefuFsaisohniomn-aMtriNxIoSfTtdhaetmasoedt.el after applying the ismigangoefraecporgevniiotiuosnly,finddeifinnegdsCimNiNlarmpoedrefolrinmtahnecceoanstethxtosoef
reported in the literature. Afterward, we re-assessed the
performance of the model after the application of input</p>
      <p>The only classes in which precision achieves good per- testing methods, discovering a notable drop in terms of
formance (above 70%) are sandal, sneaker, and bag all performance indicators measured.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <sec id="sec-7-1">
        <title>Fabio is partially funded by the Swiss National Science</title>
        <p>Foundation through SNF Projects No. PZ00P2_186090.</p>
        <p>This work has been partially supported by Qual-AI,
Fringe, and EMELIOT national research projects, which
have been funded by the MUR under the PRIN 2022, PRIN
PNRR 2022, and PRIN 2020 programs, respectively
(Contracts 2022B3BP5S, P2022553SL, 2020W3A5FY).</p>
        <p>The reported results might open some discussion on
the validation procedures to adopt when
experimenting with AI solutions, possibly paving the way to new
methodologies and standards to address the performance
of those models. At the same time, our findings
suggest that the research conducted in the field of image
recognition through AI might be worth of re-visitation
to properly understand the actual soundness of those
techniques in practice.</p>
        <p>Our future research agenda includes an extension of
this work, in which we aim to assess the performance
of CNN-based models when considering a larger variety
of input testing methods and in-vivo scenarios.
Furthermore, we aim to experiment with additional use cases,
like the models employed in the context of self-driving
cars, security assessment, and others.
[17] K. Greeshma, K. Sreekumar, Fashion-mnist classifi- Communication-eficient federated learning and
cation based on hog feature descriptor using svm, permissioned blockchain for digital twin edge
netInternational Journal of Innovative Technology and works, IEEE Internet of Things Journal 8 (2021)
Exploring Engineering 8 (2019) 960–962. 2276–2288. doi:10.1109/JIOT.2020.3015772.
[18] S. Bhatnagar, D. Ghosal, M. H. Kolekar, Classifica- [21] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince,
tion of fashion article images using convolutional F. Herrera, A review on ensembles for the class
imneural networks, in: 2017 Fourth International Con- balance problem: Bagging-, boosting-, and
hybridference on Image Information Processing (ICIIP), based approaches, IEEE Transactions on Systems,
IEEE, 2017, pp. 1–6. Man, and Cybernetics, Part C (Applications and
[19] M. Kayed, A. Anter, H. Mohamed, Classification Reviews) 42 (2012) 463–484. doi:10.1109/TSMCC.
of garments from fashion mnist dataset using cnn 2011.2161285.
lenet-5 architecture, in: 2020 international confer- [22] R. Baeza-Yates, B. Ribeiro-Neto, et al., Modern
inence on innovative trends in communication and formation retrieval, volume 463, ACM press New
computer engineering (ITCE), IEEE, 2020, pp. 238– York, 1999.</p>
        <p>243.
[20] Y. Lu, X. Huang, K. Zhang, S. Maharjan, Y. Zhang,</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>