1. Introduction

How May Deep Learning Testing Inform Model Generalizability? The Case of Image Classification

Giammaria Giordano

Valeria Pontillo

Giusy Annunziata

Antonio Cimino

Filomena Ferrucci

Fabio Palomba

0 0 Software Engineering (SeSa) Lab - Department of Computer Science, University of Salerno , Italy

Artificial intelligence (AI) has become increasingly popular and is used in various fields, particularly image recognition. Several studies use images to train self-driving car models, security monitoring systems, recognize signals, etc. However, the approach taken to design and evaluate AI models can significantly afect the resulting performance of the models during operation. Hence, applying a rigorous approach to the design and evaluation of AI models may become crucial: this is the ultimate goal of the research field of Software Engineering for Artificial Intelligence . While current literature on image recognition proposed AI pipelines achieving good performance, it is still unclear how they would work in a real environment, where additional social and environmental factors come into play. In this paper, we propose a preliminary investigation into the role of input testing as a early indicator of the real-world performance of deep learning models in the context of image recognition. By taking the well-known Fashion-MNIST dataset into account, we first design a Convolutional Neural Network able to recognize images, in an efort of replicating the work done in previous studies and establishing a baseline. Then, we propose the use of input testing to simulate real-case conditions. Our preliminary results show that the devised CNN can lead to precision, recall, F-Measure, and accuracy close to 90%, hence confirming the results of previous experimentation in the ifeld. Nonetheless, when input testing is applied, the performance of the model drastically drops (reaching ≈ 30%), possibly highlighting the need for revisiting image recognition models.

eol>Empirical Software Engineering Software Engineering for Artificial Intelligence Deep Learning

1. Introduction Unfortunately, all that glitters is not gold: despite the

promising results obtained in previous studies, the apSoftware Engineering for Artificial Intelligence (SE4AI) plicability of these models in a real-world scenario still refers to the use of software engineering principles to seems to be quite limited today due to external conditions, manage complex Artificial Intelligence (AI) models in e.g., environmental factors, which can render the systems order to rigorously test and ensure their scalability, in- unsuitable. As an example, Beede et al. [6] investigated teroperability, and maintenance over time [1]. Hence, the prediction correctness of deep learning models for the principles of SE4AI could be applied with the aim of diabetic eye disease with a strong performance in in-vitro developing efective, eficient, reliable, and sustainable experiments. Their results indicated poor performance AI models. In the last years, researchers and practition- due to socio-environmental factors that impacted the ers have been focusing on object recognition and image in-vivo experimentation. This study suggests that an classification, developing a large amount of AI systems improved assessment of these models would inform the with good performance [2, 3]. The reason behind this design of efective solutions that may reach good perforchoice is related to the availability of large datasets of mance when employed in production. images, e.g., Fashion-MNIST or MNIST datasets, that can For this reason, this paper proposes a preliminary inbe applied in various studies spanning diferent fields, vestigation into the ecological validity [7] of AI models e.g., from healthcare to self-driving cars [4, 5]. proposed in the context of image recognition, namely we aim to understand how generalizable the experimenSATToSE’23: 15th Seminar Series on Advanced Techniques & Tools for tal results previously presented would be in a real-case Software Evolution, June 12–14, 2023, Fisciano, Italy scenario. More specicfially, starting from the Fashion*$Cogriraegsipoordnadninog@auuntihsoar.i.t (G. Giordano); vpontillo@unisa.it MNIST dataset, we first built a Convolutional Neural Net(V. Pontillo); gannunziata@unisa.it (G. Annunziata); work (CNN) using software engineering principles, hence a.cimino10@studenti.unisa.it (A. Cimino); ferrucci@unisa.it conducting in-vitro experimentation in an efort of cor(F. Ferrucci); fpalomba@unisa.it (F. Palomba) roborating previous results and establishing a baseline.

0000-0003-2567-440X (G. Giordano); 0000-0001-6012-9947 Then, we apply input testing [8], with the aim to un(0V0.0P0-o0n0t0il2l-o0)9;7050-0899-70200(2F-.0F7e4r2r-u7c2c6i)1; (0G0.00A-n00n0u1n-z9i3a3t7a-);5116 derstand to what extent the training set data fit the AI (F. Palomba) model, i.e., altering the inputs of the model to simulate © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License an in-vivo experimentation [9].

CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)

On the one hand, our preliminary findings corroborate the image recognition performance reported in literature when considering the in-vitro experimentation: indeed, the engineered CNN developed reached levels of precision, recall, F-Measure, and accuracy close to 90%. On the other hand, we discover that the performance of the same model drastically drops when input testing is applied, hence suggesting that (1) the currently available models would not properly work in practice and (2) input testing may provide insights to machine learning engineers on the generalizability of the model in practice, hence possibly informing their design actions.

Structure of the paper. Section 2 overviews the background and the state of the art by pointing out the main diferences between our work and the literature. Section 3 overviews the research questions driving our study and the research method, while Section 4 discusses our preliminary results. Finally, Section 5 summarizes the highlights of this work and outlines our future work.

2. Background and Related Work This section describes the background and the related work that are the foundations of our proposed approach. 2.1. Background Most of the research conducted on image recognition

relied on the so-called Fashion-MNIST dataset.1 This is the reason why the research presented in the remainder of this paper focuses on understanding the performance of a deep learning solution on this dataset. In particular, Fashion-MNIST is a clothes dataset based on the assortment on Zalando’s website proposed by Xiao et al. [10]. It is considered a benchmark dataset containing images with the following characteristics: (1) all instances are normalized in a dimension of 28x28 pixels; (2) the images are preprocessed and converted into a gray scale; and (3) each pixel is composed of a value ranging from 0 to 255 based on the color intensity. The dataset contains over 70,000 examples of t-shirts, dresses, and so on, split into two sets, the training that contains 60,000 images and the test with 10,000 instances. In addition, the dataset is divided into 10 classes, one for each clothes category, e.g., t-shirts, trousers, and pullovers. Figure 1 shows some images from the dataset.

2.2. Related Work In the context of object detection and image classification [11, 12], Fashion-MNIST dataset appears in the top

10 most datasets used for several purposes, e.g., investigation privacy [13] issues. Xiao et al. [10], the authors of Fashion-MNIST dataset, compared several classifiers, e.g., Decision Tree and Extra Tree Classifier , and they achieved performance on average around 80% in terms of accuracy.

In 2021, Leithardt [14] performed a comparison between diferent classification methods e.g., Support Vector Classification , Linear Support Vector Classification , and various Convolutional Neural Network approaches. The best classification model was CNN-dropout-3, with accuracy above 99%, while the worst model was Gaussian Naive Bayes with accuracy around 51%. Saquib and Zahra [15] proposed an improvement of the Adam algorithm [16] named Mean-ADAM meant to reduce the oscillation of the weights—which is usually considered the general problem that makes the accuracy fall—and outperforms all other adaptive gradient methods until final training.

A 2 stochastic optimization algorithm is used, by which the variance of the weights might increase during the optimization, therefore, there is a progressive reduction of the external weight that will improve the accuracy, generalizability, and data set invariance. Their results showed good accuracy (reaching ≈ 90%) for several neural networks, e.g., ResNet, VGGNet, and Inception V1.

Greeshma et al. [17] presented the classification of Fashion-MNIST dataset using a Multiclass Support Vector Machine (SVM); their results showed an accuracy above 86%. Similarly, Xhaferra et al. [12] used deep learning models in e-commerce to solve problems related to clothing recognition. The authors developed a neural network with an accuracy of 93.1%. Bhatnagar et al. [18] proposed three diferent convolutional neural network architectures using batch normalization and residual skip connections, reaching 90% accuracy. Finally, Kayed et al. [19] built upon previous studies and improved CNN’s performance by leveraging a LeNet-5 architecture. In this way, the authors achieved 98% accuracy.

While experimenting with multiple shallow and deep 1The Fashion-MNIST dataset: https://github.com/zalandoresearch/ learning solutions, most of the studies discussed above fashion-mnist 2.3. Limitations of the State of the Art reported the models based on Convolutional Neural Net- work in a real-world scenario. The perspective is of both work (CNN) as the best solutions fitting the problem of researchers and practitioners; the former are interested in image recognition. This aspect informed the design of assessing the current state of the art, hence understandour experiment, which indeed investigates the in-vitro ing how software engineering practices can assist the and in-vivo performance of a CNN model. development of AI solutions. The latter are interested in evaluating the capabilities of AI models in a real-context scenario. Based on the previous considerations, we ask:

2https://scikit-learn.org/stable/modules/neural_networks_

supervised.html By analyzing the state of the art, we highlighted a number Û RQ1. What is the performance of an engineered of challenges for the Software Engineering for Artificial Convolution Neural Network when applied for the task Intelligence (SE4AI) research community. The interested of image recognition? reader might have a full overview of the current challenges in the field through the systematic literature review conducted by Giordano et al. [13].

First, we observed that previous work assessed the Û RQ2. To what extent the application of input testing proposed approaches only through in-vitro experimen- methods impact the performance of an engineered Contation, hence investigating the performance of machine volution Neural Network when applied for the task of and deep learning models in terms of performance in- image recognition? dicators computed when running them against datasets using validation strategies such as percentage split or Figure 2 shows the research method applied to ancross-fold validation. On the contrary, to the best of our swer our research questions. Specifically, to address knowledge, there is no study that attempted to provide RQ1, we developed a Convolutional Neural Network indications of the ecological validity of the models. (CNN) using Scikit-learn2 and applied it on the Fashion

In addition, most studies only experimented with the MNIST dataset. We focused on this dataset because preaccuracy metric [20], namely the total amount of cor- vious studies have shown that Fashion-MNIST is considrect predictions made by a model. However, the use of ered one of the most used datasets in image recognition accuracy can cause multiple biases. In the first place, [13]. The popularity of the Fashion-MNIST dataset dethe accuracy does not consider the distribution of the pends on its features reported in Section 2.1. training and test sets. Suppose the training data is sig- We decided to re-evaluate a CNN to assess the statenificantly diferent from the test data. In that case, the of-the-art and to avoid possible biases due to diferent accuracy metric can be biased due to the learning efect environments, configurations, and library versions. We where the model memorizes the training data instead of trained the algorithm applying the data augmentation, learning the true underlying data model. In the second i.e., a technique that increases the data available by modplace, although accuracy is one of the most analyzed met- ifying the initial images with filters to change the color rics for understanding the efectiveness of an AI model, palette to permit us to increase the original dataset size it is not an appropriate measure for unbalanced datasets from 60,000 entries to 300,000. Finally, we divided the since it does not distinguish between the numbers of dataset by 85% for the training set and 15% for the test set. correctly classified examples of diferent classes, leading To understand the performance of our model, we evaluto erroneous conclusions [21]. For this reason, it may be ated the approach with a number of state-of-the-art metmore appropriate to consider other evaluation metrics rics, i.e., precision, recall, F-Measure, and accuracy [22]. such as F-Measure and recall to assess the AI models. Once we had established a baseline, we proceeded

In this work, we aim at addressing the two limita- with RQ2, where we focused on the potential behavior tions above. We indeed devised a baseline CNN model to of the CNN in a real-world context. Specifically, we apclassify images that we first assessed through multiple plied input testing methods [8] to analyze the training performance indicators. Afterward, we experimented data used to train the model, with the aim of identifying with input testing to investigate the potential ecological potential issues in the training set data. Hence, we crevalidity of the model in a real-case scenario. ated customized instances of the Fashion-MNIST dataset by introducing diferent noises on the test set data to 3. Research Method simulate a real-world scenario, e.g, rain or fog. For our preliminary evaluation, we applied a cut filtering on the Fashion-MNIST dataset to simulate the scenario in which images are not perfectly aligned to the center. Figure 3

The ultimate goal of this study was to apply input testing

methods to verify the behavior of a deep neural network model built in the context of image recognition, with the purpose of analyzing how the model would potentially Filtering Convolutional Neural Network Answer RQ2 Custom  F-MNIST Custom  F-MNIST .  .  . Custom  F-MNIST Fashion MNIST

Data

Augmentation shows an example of the application of this filter on Fashion-MNIST dataset: for each clothes category, the cut was made vertically in the center of the image, so the garment is not fully visible. The application of this iflter is useful for simulating low visibility conditions, e.g., trafic signs that are not fully visible in the context of self-driving cars. We then re-assessed the approach in terms of precision, recall, F-Measure, and accuracy [22].

4. Preliminary Results The following sections describe the preliminary results achieved to address the two research questions. 4.1. RQ1 - Replicating Previous

Experiments model achieves good performance (above 90%), going to confirm the results already shown in the literature [11, 12, 14]. Analyzing the other metrics, we can also see that the performance is always very positive (again above 90%) for each epoch and batch size considered. To conclude, our replication found results similar to those reported in previous experiments, hence confirming that a CNN approach can efectively recognize images when applied against the Fashion-MNIST dataset.

ø Key findings of RQ 1.

Our replication study corroborates previous findings

in the field of image recognition through AI. The performance of the CNN model is over 90% in terms of accuracy, F-Measure, Recall, and Precision.

4.2. RQ2 - On the Impact of Input Testing

classes. This could happen because, although the elements were cut in half, they still preserved elements that make them always distinguishable.

Finally, Table 2 reports the results for the CNN model after the application of the cut filtering. In this case, we can observe a severe decrease of all metrics, especially for the F-Measure, that reached no more than 27% against the previous 93%.

In addition, also the other metrics, i.e., Precision, Recall, and Accuracy, do not reach values above 40%. These results suggest that when the model cannot consider the entire image of a garment, then it may have a large loss of information, e.g., on the shape, which leads to lower performance. While further analysis is required to understand how usable and generalizable deep learning models are in a real-world context, our findings suggest that (1) existing models would not properly work in conditions where the images are not perfectly passed as input; and (2) input testing seems to be a valid instrument to establish the performance of AI models, possibly informing machine learning engineers and data scientists on the need for taking further actions.

ø Key findings of RQ 2.

Our preliminary results indicated that the application of input testing methods lets the performance of the CNN decrease up to 60% with respect to what reported in literature. The overall performance ranged, indeed, between 19% to 33% in terms of precision, recall, F-Measure, and accuracy.

5. Conclusion

This paper provided a preliminary analysis of how existing deep learning solutions work when they are experimented in seemingly real conditions through the application of input testing methods. We first replicated the decFuigt ufirlteer5:toCtohnefuFsaisohniomn-aMtriNxIoSfTtdhaetmasoedt.el after applying the ismigangoefraecporgevniiotiuosnly,finddeifinnegdsCimNiNlarmpoedrefolrinmtahnecceoanstethxtosoef reported in the literature. Afterward, we re-assessed the performance of the model after the application of input

The only classes in which precision achieves good per- testing methods, discovering a notable drop in terms of formance (above 70%) are sandal, sneaker, and bag all performance indicators measured.

Acknowledgments Fabio is partially funded by the Swiss National Science

Foundation through SNF Projects No. PZ00P2_186090.

This work has been partially supported by Qual-AI, Fringe, and EMELIOT national research projects, which have been funded by the MUR under the PRIN 2022, PRIN PNRR 2022, and PRIN 2020 programs, respectively (Contracts 2022B3BP5S, P2022553SL, 2020W3A5FY).

The reported results might open some discussion on the validation procedures to adopt when experimenting with AI solutions, possibly paving the way to new methodologies and standards to address the performance of those models. At the same time, our findings suggest that the research conducted in the field of image recognition through AI might be worth of re-visitation to properly understand the actual soundness of those techniques in practice.

Our future research agenda includes an extension of this work, in which we aim to assess the performance of CNN-based models when considering a larger variety of input testing methods and in-vivo scenarios. Furthermore, we aim to experiment with additional use cases, like the models employed in the context of self-driving cars, security assessment, and others. [17] K. Greeshma, K. Sreekumar, Fashion-mnist classifi- Communication-eficient federated learning and cation based on hog feature descriptor using svm, permissioned blockchain for digital twin edge netInternational Journal of Innovative Technology and works, IEEE Internet of Things Journal 8 (2021) Exploring Engineering 8 (2019) 960–962. 2276–2288. doi:10.1109/JIOT.2020.3015772. [18] S. Bhatnagar, D. Ghosal, M. H. Kolekar, Classifica- [21] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, tion of fashion article images using convolutional F. Herrera, A review on ensembles for the class imneural networks, in: 2017 Fourth International Con- balance problem: Bagging-, boosting-, and hybridference on Image Information Processing (ICIIP), based approaches, IEEE Transactions on Systems, IEEE, 2017, pp. 1–6. Man, and Cybernetics, Part C (Applications and [19] M. Kayed, A. Anter, H. Mohamed, Classification Reviews) 42 (2012) 463–484. doi:10.1109/TSMCC. of garments from fashion mnist dataset using cnn 2011.2161285. lenet-5 architecture, in: 2020 international confer- [22] R. Baeza-Yates, B. Ribeiro-Neto, et al., Modern inence on innovative trends in communication and formation retrieval, volume 463, ACM press New computer engineering (ITCE), IEEE, 2020, pp. 238– York, 1999.

243. [20] Y. Lu, X. Huang, K. Zhang, S. Maharjan, Y. Zhang,