1. Introduction

Euro-Mediterranean Workshop on Artificial Intelligence and Smart Systems, October

XG-ViT: Explainable and Generalizable Vision Transformer for Benchmark Image Classification

Sonia Bouzidi

Imen Jdey

Fadoua Drira

0 0 ReGIM-Lab. REsearch Group in Intelligent Machines (LR11ES48) , ENIS, Sfax , Tunisia

2025

15 2024 15 18

In the ever-evolving fashion industry, sustainability has become a primary focus, driving companies to make significant changes to their business models. Corporate social responsibility is now crucial, emphasizing ethical practices and eco-friendly material sourcing throughout the supply chain. A key part of this collective efort involves the adoption of real-time image classification technologies. This paper introduces the XG-ViT method, an innovative approach that employs a customized Vision Transformer (ViT) for real-time clothing recognition, specifically utilizing the Fashion MNIST dataset. To ensure the ViT model's reliability, we use the Grad-CAM algorithm, which highlights the pixel areas that are most important during predictions in the final attention layer of our XG-ViT model. Our experimental results demonstrate the state-of-the-art performance of the XG-ViT method on the Fashion MNIST benchmark for real-time image classification. Notable metrics include an impressive accuracy of 92.83%, precision of 92.87%, a loss of 21.13%, an F1 score of 92.63%, and a recall of 92.65%. These outcomes clearly validate the efectiveness of the XG-ViT method in meeting the demands of real-time image classification tasks.

eol>Real-time image classification ViT Grad-CAM Fashion MNIST Benchmark XG-ViT

1. Introduction

In the current fashion industry landscape, a significant transformation in business strategies is occurring, driven largely by an increased focus on sustainability [ 1 ]. Companies are progressively shifting their operations to align with sustainable practices, emphasizing corporate social responsibility throughout their supply chains [ 2 ]. This strategic shift involves responsible sourcing, prioritizing eco-friendly materials, and adhering to ethical labor standards [ 3 ]. The push towards a sustainable future highlights the importance of integrating advanced technologies, such as real-time image classification, which can act as a catalyst for industry-wide change [ 4 ] [ 5 ]. Utilizing such technology enables businesses to enhance supply chain eficiency, minimize waste, and increase transparency [ 6 ].

Concurrently, the adoption of deep learning has emerged as a viable alternative to traditional machine learning for clothing recognition [ 7 ], ofering a strategic approach to efectively categorize garments [ 8 ] [ 9 ]. Deep learning’s superior accuracy helps reduce the likelihood of customer returns and dissatisfaction by supporting more informed purchasing decisions [ 10 ]. Although convolutional neural networks (CNNs) are widely used for clothing classification, their localized processing and high computational costs limit their practicality for real-time applications [ 11 ]. Vision Transformers (ViTs), which use self-attention mechanisms for a global understanding of context, provide a promising alternative, ofering both computational eficiency and suitability for real-time tasks [ 12 ]. However, enhancing transparency and interpretability in ViT models remains crucial. Class Activation Mapping (CAM) techniques, particularly Grad-CAM, visually highlight the key areas influencing model predictions [ 13 ].

Despite the strong performance of ViTs, they are prone to overfitting, which can degrade accuracy and generalization, especially when trained on smaller datasets [ 14 ]. Overfitting leads to poor performance on unseen data, undermining the accuracy of clothing classification models.

To address these challenges, this research introduces the XG-ViT methodology, which combines the ViT model with tfhe Grad-CAM algorithm. This integration allows for the visualization of specific pixel areas emphasized during predictions. Furthermore, the study employs k-fold cross-validation (K-FCV) to ensure a comprehensive representation of each class’s characteristics and to mitigate potential overfitting issues. The paper is structured as follows: Section 2 explores the vision transformer model, the Grad-CAM algorithm, the overfitting problem, and the K-FCV technique. Section 3 provides a literature review on image classification using the Fashion-MNIST dataset. Section 4 discusses the dataset and the proposed XG-ViT method. Experimental results and discussions are presented in Section 5. Finally, Section 6 concludes the study and suggests future research directions.

2. Background

In this section, we will delve into the key elements of our novel approach: the Vision Transformer, the Grad-CAM algorithm, the fundamental issue of overfitting, and the K-fold cross-validation method. We will also emphasize how each of these components is pertinent to our contribution.

2.1. Vision Transformer

ViT (Vision Transformer) [ 15 ] is a recent breakthrough in the area of computer vision.The history of the ViT dates back to 2017, when it was initially designed for natural language processing (NLP) [16]. However, in 2020, its application expanded to computer vision tasks, marking the advent of the "vision transformer" [17]. In 2021, the ViT surpassed the Convolutional Neural Networks (CNNs) in terms of performance and eficiency, especially in image classification[ 5 ]. The ViT stands out for its ability to capture complex patterns in images thanks to its attention-based approach, thus ofering an efective alternative to traditional architectures based on convolutions. These advances have positioned ViT as a promising method for image classification, opening up new perspectives in the field of deep learning applied to vision.

2.2. Gradient-weighted Class Activation Mapping (Grad-CAM)

Grad-CAM is widely used in image classification tasks to provide a visual interpretation of how deep neural networks make decisions. It achieves this by highlighting significant regions in an input image that influence its final classification [ 18]. This method produces a class activation map by capturing gradients associated with the target class in the feature maps of the final convolutional layer [ 19]. By identifying critical regions, Grad-CAM ofers insights into the attention mechanisms of the network, thereby enhancing interpretability and transparency in deep learning models [19]. For an illustrative example, please refer to Figure 2.

2.3. Overfitting Problem and K-fold Cross Validation

Overfitting presents a common issue in attention models, where an excessive grasp of training data details impedes efective generalization [ 20]. Despite achieving high accuracy on training data, overfit models struggle with new data due to interpreting random variations as essential concepts. Addressing this concern, the widely adopted K-fold Cross Validation (K-FCV) technique in machine learning, particularly beneficial with limited datasets, partitions the dataset into k subsets [ 21]. The model undergoes iterative training k times, using k-1 folds for training and one fold for validation in each iteration, providing a comprehensive evaluation and minimizing the risk of overfitting to specific observations [21].

3. Literature review

Methods for identifying photos have advanced significantly recently, especially when the Fashion MNIST dataset is used. As Table 3 summarizes, the goal of these investigations has been to increase classification accuracy using diferent deep learning architectures. We summarize significant contributions and place our work within the framework of this developing topic below.

A convolutional neural network method for picture categorization was presented by Kadam et al. [22], and it was tested on the Fashion-MNIST dataset. Their method attained an accuracy of 93.5% on the Fashion-MNIST dataset by experimenting with multiple architectures and fine-tuning several hyperparameters, including activation functions, optimizers, and dropout rates. This showed that their method is efective in categorizing more complex picture.

A Multiple Convolutional Neural Network (MCNN15) with 15 convolutional layers was proposed by Nocentini et al. [23] to improve the accuracy of apparel image classification on the Fashion-MNIST dataset. Their strategy concentrated on solving problems related to garment manipulation in the context of service robotics for senior care. They assessed several neural network models and using the Fashion-MNIST dataset, they were able to obtain a classification accuracy of 94.04%.

Mukherjee et al. [24] introduced a new deep learning framework called OCFormer (One-Class Transformer Network for Image Classification), leveraging ViT. Their approach achieved an accuracy of 92.71%.

Chhabra et al. [25] introduced PatchRot, a self-supervised technique designed specifically for vision transformers. By rotating images and image patches and training the network to predict rotation angles, PatchRot efectively learns both global and local features, achieving an accuracy of 92.6%.

Chhabra et al. [26] introduced PatchSwap, a regularization technique that involves swapping patches between two pictures to create new inputs for transformer regularization. Its straightforward approach facilitates easy extension to semi-supervised environments with minimal efort, achieving an accuracy of 92.6%.

Sun et al. [27] introduced MADPL-net (Multi-layer Attention Dictionary Pair Learning Network), an integrated model that combines convolutional neural network learning schemes, deep encoder learning, and attention dictionary pair learning (ADicL) into a cohesive framework. Their approach achieved an accuracy of 91.24%.

A ViT model, optimized with transformer blocks and self-attention mechanisms, was presented by Abd Alaziz et al. [28] for the classification of fashion images. The model’s eficacy across many CNN architectures was demonstrated by its 95.25% accuracy, 95.20% precision, 95.25% recall, and 95.20% F1-score on the Fashion-MNIST dataset.

Li et al. [29] introduced MLPEPS (Multi-Layered PEPS), a tensor network model designed for image classification. MLPEPS employs PEPS to extract features layer by layer from pictures, leveraging these features in Hilbert space to capture pixel correlations while preserving structural information. Their approach achieved a classification accuracy of 90.44%.

Selecting a Vision Transformer for real-time garment recognition is extremely appropriate, given the growing popularity of ViTs in picture classification. ViTs are useful for real-time scenario detection of complex features and garment designs since they are good at seeing global patterns in photos. Moreover, ViTs can perform better than CNN-based models in terms of accuracy and computational eficiency, which is important for tasks demanding quick and accurate classification.

4. Methodology

This section ofers a comprehensive explanation of our methodology, which includes the Fashion MNIST benchmark and details our approach involving the integration of ViT with the Grad-CAM method and k-FCV technique.

4.1. Fashion Mnist dataset

Several datasets are used for image recognition in the fashion sector, including Fashion MNIST [30], DeepFashion [31], Fashion IQ [32], FGVCx Fashion [33] , iMaterialist [34], and ModaNet [35]. The quantity of research articles over the years is the primary criterion for determining which database is most prevalent. These data sets can be compared in terms of picture classification tasks thanks to this method. Adhering to the guidelines of Keele et al. [36], the approach employed for this analysis comprises manual searches conducted on digital resources, including IEEEXplore Digital Library, SpringerLink, Digital Library, ACM Digital Library, Wiley Online Library, and Science Direct. Based on the information gathered and displayed in Figure 1, "Fashion-MNIST" is the most often used dataset, accounting for nearly 59.9% of research projects.

The Fashion MNIST dataset, created by Zalando, contains 70,000 grayscale images showcasing diferent fashion items in 10 distinct categories [ 8 ]. This collection is divided into a training set of 60,000 and a test set of 10,000 pictures, as shown in Table 1. Each image has dimensions of 28x28 pixels and is labeled according to its respective clothing category.

4.2. XG-ViT method and Experimental Setup

In this study, we presented our proposed XG-ViT architecture, and we detailed the implementation process by describing each step of the model’s design.

4.2.1. XG-ViT method:

Building upon the Fashion MNIST benchmark, this study elaborates on our methodology, focusing on key elements. We utilize the ViT approach for classification, aiming to enhance interpretability and pinpoint crucial image features using the Grad-CAM algorithm. Additionally, to address overfitting and ensure representative evaluation across diverse classes, we employ K-FCV.

Our XG-ViT architecture depicted in Figure 2 processes input images of size 72 × 72. Initially, pictures are divided into patches, with XG-ViT accommodating variable patch counts. These patches are treated akin to word embeddings in natural language processing, undergoing transformer-based processing. The resulting patch embeddings are then linearly projected into a consistent model dimension .

Linear Transformation of Patches to Vectors: Patches are transformed into vectors through a learnable linear projection:

linear = [1 ; 2 ; . . . ; ] + pos_linear, Here, represents each patch, multiplied by a learned projection matrix , with pos_linear incorporating positional information.

Adding Position Tokens: Position tokens are integrated with class tokens and transformed patch vectors:

0 = [class; linear] + pos, where 0 combines class and linear outputs with positional information.

Encoder Layer with Grad-CAM Integration: The encoder processes 0 through multiple blocks, each containing Multi-Head Self Attention (MHSA) and Multi-Layer Perceptron (MLP) components. (1) (2)

MHSA involves scaled dot-product attention:

Attention(, , ) = Softmax ︂( )︂ √ , (3) (4) (5) Followed by:

MHSA(, , ) = concat(Atten1, . . . , Attenℎ) , where , , are learned weights, ℎ denotes attention heads, and is the output matrix.

Classification Layer: The final encoder layer output, enhanced by Grad-CAM, feeds into a classifier: = Layer Normalization(0).

Here, represents model output, ensuring both accuracy and visual insights from Grad-CAM.

We integrated the k-FCV technique into our methodology, dividing Fashion MNIST into subsets for comprehensive validation:

In our approach, the Fashion MNIST dataset is divided into equal subsets. The K-fold cross-validation iterates times, using one subset for testing and the rest for training per iteration. Performance metrics from all folds aggregate to assess model generalization, ensuring robust evaluation across diverse dataset subsets.

4.2.2. Experimental Setup

Both ViT and XG-ViT implementations were conducted using Python 3.x and TensorFlow on Google Colab. A Tesla K80 GPU was utilized to accelerate model training and results generation. Hyperparameters used in our experiments are detailed in Table 2. The input pictures were resized to 72 × 72 pixels, with a patch size of 6 for the input sequence. A batch size of 256 was chosen to optimize accuracy. The initial learning rate was set to 0.001 after extensive evaluation. The AdamW optimizer with a momentum of 0.9 was selected based on its superior performance in comparative experiments. Our model architecture includes 5 attention heads, 8 transformer layers, and was trained for 25 epochs. A weight decay of 0.0001 was applied to regularize the model. We employed 5-FCV to ensure robust evaluation and enhance generalization of the model across diferent subsets of the dataset.

5. Results and Discussion

This section presents the outcomes of our methodology, highlighting the eficacy of our ViT-based approach, integration of K-FCV, and Grad-CAM algorithm. We discuss the achieved results in detail and analyze the strengths and challenges encountered during our experiments.

5.1. Performance Evaluation Metrics

Our XG-ViT method’s performance was evaluated using several key metrics to assess its efectiveness in image classification tasks. These metrics provide a comprehensive view of model performance across diferent aspects: • Precision: Precision measures the accuracy of positive predictions. It is calculated as: =

+ =

+ where TP (True Positive) and FP (False Positive) represent correct and incorrect positive predictions, respectively. • Recall: Recall measures the proportion of correctly predicted positive instances out of all actual positive instances. It is defined as: where FN (False Negative) represents instances incorrectly classified as negative. • F1 Score: The F1 score is the harmonic mean of precision and recall, providing a balance between the two metrics:

× 1 = 2 × + (6) (7) (8) • Accuracy: Accuracy indicates the proportion of correctly classified instances among all instances: where TN (True Negative) represents correctly classified negative instances. • Top 5 Accuracy: This metric measures the percentage of samples for which the correct label is among the top 5 predicted labels. It is computed as: =

+ + + + Top 5 Accuracy = ︂( )︂ where is the number of samples correctly classified among the top 5 predictions, and is the total number of samples. • Loss: Loss quantifies the diference between predicted values and actual labels in the dataset, providing a measure of model performance:

These metrics collectively assess the performance of our XG-ViT model, providing insights into its classification accuracy, predictive power, and robustness.

5.2. Experimental Results

Our experimental results demonstrate the performance of the XG-ViT model across various evaluation metrics. The model achieved a mean accuracy of 91.30%, precision of 90.64%, recall of 90.56%, training time of 31:56 min, test top-5 accuracy of 99.92%, loss of 22.86%, and F1 score of 90.51%. These metrics collectively indicate the model’s robust performance in classifying Fashion MNIST dataset, as summarized in Table 4.

In analyzing the ViT results, we noted a significant disparity between training and testing accuracies. While the training accuracy reached 96.54%, the testing accuracy was 91.30%. This discrepancy suggests that ViT trained on the Fashion MNIST dataset struggles with generalization to new data, indicating potential overfitting. To validate our model’s generalizability, we employed a 5-fold cross-validation method. This method efectively mitigates overfitting and enhances the robust utilization of the Fashion MNIST dataset. Through iterative evaluations with varying values, we consistently found optimal results when ≥ 5. Figure 3 illustrates the progression of model evaluation across diferent values up to 5. Moreover, this underscores the efectiveness of leveraging the Fashion MNIST dataset to its fullest extent, resulting in improved performance for image classification task.

As illustrated in Table 4, the application of k-FCV notably enhanced the performance of the ViT model. We observed significant improvements across key metrics: accuracy improved to 92.83%, precision to 92.87%, recall to 92.65%, training time reduced to 20 minutes and 39 seconds, top-5 accuracy reached 99.94%, and the F1 score increased to 92.63%. Moreover, the loss value decreased substantially to 21.13%. As the number of epochs increased, we observed improved performance. However, this also resulted in a significant increase in processing time. Therefore, it is essential to find a balance to enhance performance while reducing execution time.

Following the implementation of k-FCV, we conducted an ablation experiment by integrating the GradCAM algorithm with ViT and k-FCV. Interestingly, our experiment revealed no significant variations in the results. This experiment confirmed that the inclusion of Grad-CAM did not substantially afect the performance of our method. Instead, its primary role was to enhance the interpretability of our model’s decision-making process. The integration of Grad-CAM has proven instrumental in elucidating the critical regions within input pictures that influence the final classification of our EViT method. By leveraging Grad-CAM, we visually inspect these pivotal regions, providing a transparent portrayal of their impact on the model’s decisions. This integration enhances transparency and instills confidence in the reliability of our model’s predictions, as demonstrated in Figure 4.

5.3. Discussion

Our study firmly establishes the superiority of the ViT model over CNNs. The self-attention mechanism in ViTs enhances its ability to capture semantic relationships across long distances, thereby significantly improving task performance. Importantly, existing research often lacks information on the execution times of CNNs, which is crucial for computationally eficient applications. CNNs, with their localized computation nature, may experience prolonged execution times when applied to complex architectures. In contrast, ViTs proficiency in handling global information suggests their potential to mitigate these challenges, opening up new possibilities for resource-intensive image processing applications.

In important aspects beyond accuracy, our suggested XG-ViT model performs better than not just our baseline ViT (Table 4) but also a number of approaches found in the literature (Table 3). Even though some techniques are more accurate than others, XG-ViT has better explainability thanks to Grad-CAM, which gives users a visual glimpse into the model’s decision-making process and makes the results easier to understand and believe. Furthermore, in order to present a complete picture of its performance, XG-ViT makes use of a wide range of assessment metrics, including accuracy, precision, recall, F1 score, loss, training time, and test top-5 accuracy. This diversity guarantees that the model is assessed from several angles, such as its capacity to reduce false positives, identify pertinent examples, strike a balance between precision and recall, and monitor training eficiency and prediction mistakes. By lowering overfitting and enhancing generalizability, K-fold cross-validation enhances the model’s robustness and guarantees dependability across various data subsets. Thus, XG-ViT shows itself to be a versatile and adaptive model that works well for real-world apparel detection tasks where robustness, explainability, and consistent performance are crucial.

We experimented on two additional datasets, Deep Fashion [37], which has 68,000 images (58,000 for training and 10,000 for testing) across 50 categories, and FGVCx Fashion [17], which has 55,000 runway images (50,000 for training and 5,000 for testing) across 50 categories, in order to evaluate the generalizability of our method. These datasets are particularly dificult for classification tasks because they include high-resolution photographs with complicated backgrounds and substantial variation in stance, lighting, and apparel.

In comparison to the other two datasets, our results, which are summed up in Table 4, show a discernible performance gain on the Fashion MNIST dataset. There are several reasons for these variations in performance. First of, the Fashion MNIST dataset features pictures of distinct apparel items against consistent backdrops, which probably made our method’s classification task easier. Secondly, the Fashion MNIST dataset shows a lower training time while being larger (70,000 pictures total; 60,000 for training and 10,000 for testing). The reason for this is that the pictures were relatively simple (lower resolution, 28x28 pixels), which allowed for faster processing, and the extraction of visual information required a less complicated convolutional model. The performance might have also been afected by the fact that the class distribution in Fashion MNIST seems to be more balanced than in the other datasets. Finally, the presence of intricate elements like patterns or accessories in these photos, along with variations in annotation quality or selection biases in the Deep Fashion and FGVCx Fashion datasets, may have added to the longer training times and poorer performance. We can better understand why our method performed diferently across datasets by looking at these characteristics, which increases our confidence in its capacity to generalize.

6. Conclusion

In conclusion, this research highlights the efectiveness of employing the ViT model with the Grad-CAM method and K-FCV method for fashion MNIST classification, particularly in the realm of sustainable fashion. Our study demonstrates the capability of our approach to achieve high accuracy in image classification tasks. The Grad-CAM provides valuable insights by highlighting crucial regions in images that influence classification decisions, while K-FCV ensures robust model generalization and helps mitigate overfitting. Our experiments underscore the approach’s resilience to hyperparameter tuning, showcasing its efectiveness in handling diverse classes and intricate image features. Looking ahead, we plan to explore the applicability of our method across various domains to ensure its scalability and broader utility in future research endeavors. In particular, we aim to apply this methodology in healthcare applications, especially for the skin skin disease recognition tasks, where explainability and reliable predictions are critically important.

Declaration on Generative AI

The authors have not employed any Generative AI tools.

Acknowledgment

This work was supported by the Ministry of Higher Education and Scientific Research of Tunisia through the ReGIM-Lab. REsearch Groups in Intelligent Machines (LR11ES48). The authors gratefully acknowledge this support. [16] S. Jamil, M. Jalil Piran, O.-J. Kwon, A comprehensive survey of transformers for computer vision,

Drones 7 (2023) 287. [17] E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, U. Evci, K. Xu, R. Goroshin, C. Gelada, K. Swersky, P.-A. Manzagol, et al., Meta-dataset: A dataset of datasets for learning to learn from few examples, arXiv preprint arXiv:1903.03096 (2019). [18] H. Jiang, J. Xu, R. Shi, K. Yang, D. Zhang, M. Gao, H. Ma, W. Qian, A multi-label deep learning model with interpretable grad-cam for diabetic retinopathy classification, in: 2020 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), IEEE, 2020, pp. 1560–1563. [19] S. Kumar, A. A. Abdelhamid, Z. Tarek, Visualizing the unseen: exploring grad-cam for interpreting convolutional image classifiers, J. Full Length Artic 4 (2023) 34–42. [20] J. Li, F. Gao, S. Lin, M. Guo, Y. Li, H. Liu, S. Qin, Q. Wen, Quantum k-fold cross-validation for nearest neighbor classification algorithm, Physica A: Statistical Mechanics and its Applications 611 (2023) 128435. [21] B. Anandan, M. Manikandan, Machine learning approach with various regression models for predicting the ultimate tensile strength of the friction stir welded aa 2050-t8 joints by the k-fold cross-validation method, Materials Today Communications 34 (2023) 105286. [22] S. S. Kadam, A. C. Adamuthe, A. B. Patil, Cnn model for image classification on mnist and fashion-mnist dataset, Journal of scientific research 64 (2020) 374–384. [23] O. Nocentini, J. Kim, M. Z. Bashir, F. Cavallo, Image classification using multiple convolutional neural networks on the fashion-mnist dataset, Sensors 22 (2022) 9544. [24] P. Mukherjee, C. K. Roy, S. K. Roy, Ocformer: One-class transformer network for image classification, arXiv preprint arXiv:2204.11449 (2022). [25] S. Chhabra, P. B. Dutta, H. Venkateswara, B. Li, Patchrot: A self-supervised technique for training vision transformers, arXiv preprint arXiv:2210.15722 (2022). [26] S. Chhabra, H. Venkateswara, B. Li, Patchswap: A regularization technique for vision transformers., in: BMVC, 2022, p. 996. [27] Y. Sun, G. Shi, W. Dong, X. Xie, Madpl-net: Multi-layer attention dictionary pair learning network for image classification, Journal of Visual Communication and Image Representation 90 (2023) 103728. [28] H. M. Abd Alaziz, H. Elmannai, H. Saleh, M. Hadjouni, A. M. Anter, A. Koura, M. Kayed, Enhancing fashion classification with vision transformer (vit) and developing recommendation fashion systems using dinova2, Electronics 12 (2023) 4263. [29] L. Li, H. Lai, Multi-layered projected entangled pair states for image classification, Sustainability 15 (2023) 5120. [30] J. Xin, T. J. Yi, V. P. Yi, P. J. Yu, Z. A. A. Salam, Convolutional neural network for fashion images classification (fashion-mnist), Journal of Applied Technology and Innovation 7 (2023) 11. [31] H. An, K. Y. Lee, Y. Choi, M. Park, Conceptual framework of hybrid style in fashion image datasets for machine learning, Fashion and Textiles 10 (2023) 18. [32] H. Wu, Y. Gao, X. Guo, Z. Al-Halah, S. Rennie, K. Grauman, R. Feris, Fashion iq: A new dataset towards retrieving images by natural language feedback, in: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2021, pp. 11307–11317. [33] S. X. Hu, D. Li, J. Stühmer, M. Kim, T. M. Hospedales, Pushing the limits of simple pipelines for few-shot learning: External data and fine-tuning make a diference, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9068–9077. [34] S. Guo, W. Huang, X. Zhang, P. Srikhanta, Y. Cui, Y. Li, H. Adam, M. R. Scott, S. Belongie, The imaterialist fashion attribute dataset, in: Proceedings of the IEEE/CVF international conference on computer vision workshops, 2019, pp. 0–0. [35] X. Wang, Towards color compatibility in fashion using machine learning, 2019. [36] S. Keele, et al., Guidelines for performing systematic literature reviews in software engineering,

Technical Report, Technical report, ver. 2.3 ebse technical report. ebse, 2007. [37] Z. Liu, P. Luo, S. Qiu, X. Wang, X. Tang, Deepfashion: Powering robust clothes recognition and retrieval with rich annotations, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1096–1104.

[1]

T. S.

Thorisdottir , L. Johannsdottir, Corporate social responsibility influencing sustainability within the fashion industry. a systematic review , Sustainability 12 ( 2020 ) 9167 .

[2]

Bartkute ̇,

Streimikiene , T. Kačerauskas, Between fast and sustainable fashion: The attitude of young lithuanian designers to the circular economy , Sustainability 15 ( 2023 ) 9986 .

[3]

James ,

Kurian , Sustainable packaging: A study on consumer perception on sustainable packaging options in e-commerce industry , Natural Volatiles & Essential Oils 8 ( 2021 ) 10547 - 10559 .

[4]

Zhang ,

Guo ,

Chen ,

Pan , G. Beliakov,

Wu , A brief survey of machine learning and deep learning techniques for e-commerce research , Journal of Theoretical and Applied Electronic Commerce Research 18 ( 2023 ) 2188 - 2216 .

[5]

Bbouzidi , G. Hcini,

Jdey ,

Drira , Convolutional neural networks and vision transformers for fashion mnist classification: A literature review , arXiv preprint arXiv:2406.03478 ( 2024 ).

[6]

Wang , Retracted: Classification and identification of garment images based on deep learning , Journal of Intelligent & Fuzzy Systems 44 ( 2023 ) 4223 - 4232 .

[7]

Jdey ,

Toumi ,

Khenchaf ,

Dhibi ,

Bouhlel , Fuzzy fusion system for radar target recognition , International Journal of Computer Applications & Information Technology 1 ( 2012 ) 136 - 142 .

[8]

Bouzidi , I. Jdey ,

Alimi , A vision transformer approach with l2 regularization for sustainable fashion classification , Available at SSRN 4686032 ( 2024 ).

[9]

Hcini , I. Jdey ,

Heni ,

Ltifi , Hyperparameter optimization in customized convolutional neural network for blood cells classification , J. Theor. Appl. Inf. Technol 99 ( 2021 ) 5425 - 5435 .

[10] I. Jdey , Trusted smart irrigation system based on fuzzy iot and blockchain , in: International Conference on Service-Oriented Computing, Springer, 2022 , pp. 154 - 165 .

[11]

He ,

Yang ,

Chen , Real-time fashion-guided clothing semantic parsing: A lightweight multi-scale inception neural network and benchmark ., in: AAAI Workshops , volume 1 , 2017 .

[12]

Graham ,

El-Nouby ,

Touvron ,

Stock ,

Joulin ,

Jégou ,

Douze , Levit: a vision transformer in convnet's clothing for faster inference , in: Proceedings of the IEEE/CVF international conference on computer vision , 2021 , pp. 12259 - 12269 .

[13]

Katar ,

Yildirim , An explainable vision transformer model based white blood cells classification and localization , Diagnostics 13 ( 2023 ) 2459 .

[14]

Li ,

B.-B.

Zhang , Fv-vit: Vision transformer for finger vein recognition , IEEE Access 11 ( 2023 ) 75451 - 75461 .

[15]

O. N.

Manzari ,

Ahmadabadi ,

Kashiani ,

S. B.

Shokouhi ,

Ayatollahi , Medvit: a robust vision transformer for generalized medical image classification , Computers in biology and medicine 157 ( 2023 ) 106791 .