=Paper=
{{Paper
|id=Vol-3741/paper26
|storemode=property
|title=Data Filtering for a Sustainable Model Training
|pdfUrl=https://ceur-ws.org/Vol-3741/paper26.pdf
|volume=Vol-3741
|authors=Francesco Scala,Sergio Flesca,Luigi Pontieri
|dblpUrl=https://dblp.org/rec/conf/sebd/ScalaFP24
}}
==Data Filtering for a Sustainable Model Training==
<pdf width="1500px">https://ceur-ws.org/Vol-3741/paper26.pdf</pdf>
<pre>
                                Data Filtering for a Sustainable Model Training
                                Francesco Scala1,2,* , Sergio Flesca1 and Luigi Pontieri2
                                1
                                  Dept. Computer Engineering, Modeling, Electronics, and Systems Engineering (DIMES), University of Calabria, 87036
                                Rende (CS), Italy
                                2
                                  Institute of High Performance Computing and Networking (ICAR-CNR), Via P. Bucci, 87036 Rende (CS), Italy


                                            Abstract
                                            The remarkable capabilities of deep neural networks (DNNs) in addressing intricate problems are
                                            accompanied by a notable environmental toll. Training these networks demands immense energy
                                            consumption, owing to the vast volumes of data needed, the sizeable models employed, and the prolonged
                                            training durations. Compounded by the principles of Green-AI, which emphasize reducing the ecological
                                            footprint of AI technologies, this poses a pressing concern. In response, we introduce DFSMT, an
                                            approach tailored to selecting a subset of labeled data for training, thereby aligning with Green-AI
                                            objectives. Our methodology leverages Active Learning (AL) techniques, which systematically identify
                                            and select batches of the most informative instances of the data for model training. Through an iterative
                                            application of diverse AL strategies, we curate a labeled data subset that preserves adequate information
                                            to maintain model quality standards. Empirical results underscore the effectiveness of our approach,
                                            demonstrating substantial reductions in labeled data requirements without significantly compromising
                                            model performance. This achievement carries particular significance in the context of Green-AI, providing
                                            a pathway to mitigate the environmental impact of AI training processes.

                                            Keywords
                                            Active Learning, Green-AI, Data Selection, Energy Efficiency, Sustainability


                                1. Introduction
                                Artificial Intelligence (AI) has undergone significant growth in recent years, bringing about
                                transformative changes in various industries and offering innovative solutions to intricate
                                problems. Its impact spans sectors ranging from healthcare and finance to manufacturing
                                and retail, reshaping both our lifestyles and professional environments. Nevertheless, this
                                expansive development has introduced challenges, particularly in terms of increased energy
                                consumption and, consequently, carbon emissions. Moreover, this issue is projected to escalate
                                significantly, as highlighted in [1]. The training phase of AI models, with its substantial
                                demands for data and computing power, is a primary contributor to this energy-intensive
                                process [2]. Effectively training high-performing AI models necessitates vast amounts of data
                                and considerable computing power, resulting in a notable increase in energy consumption.
                                The carbon emissions linked to AI predominantly stem from the electricity utilized during the
                                training phase of these models. Since electricity predominantly originates from non-renewable
                                energy sources, such as coal and natural gas, training AI models significantly contribute to

                                SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy
                                *
                                 Corresponding author.
                                $ francesco.scala@icar.cnr.it (F. Scala); sergio.flesca@unical.it (S. Flesca); luigi.pontieri@icar.cnr.it (L. Pontieri)
                                 0009-0007-5224-0910 (F. Scala); 0000-0002-4164-940X (S. Flesca); 0000-0003-4513-0362 (L. Pontieri)
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
global warming. Indeed, despite advancements, non-renewable energy sources still dominate
the majority of the energy production landscape [3].
  The aim of reducing the effect on global warming pushed the research community to work on
the topic of Green-AI, whose aim is to reduce the environmental impact of AI by promoting the
development of efficient and sustainable models and algorithms. Green-AI focuses on several
key areas:

    • Reducing energy consumption: Developing models and algorithms that require less
      energy for training and use;
    • Using renewable energy: Powering AI training and use with renewable energy, such as
      solar and wind power;
    • Developing efficient hardware: Designing hardware specifically for AI that is more
      energy efficient;
    • Recycling and reuse: Promoting the recycling and reuse of hardware components used
      for AI.

   In this paper, we investigate the issue of diminishing energy consumption during the training
phase of AI models. Various methodologies have been introduced to tackle this challenge,
including MdBR [4] for regression on static data, n-gram counting [5] for machine translation
and the enhanced OPF method by Chouvatut et al. [6] which minimizes training set size for
classifiers with minimal accuracy loss. Furthermore, clustering techniques have been employed
to eliminate irrelevant training samples.
   In this work, we investigate the possibility of leveraging Active Learning (AL) [7, 8, 9, 10, 11,
12] to reduce the volume of data required for training AI models, by meticulously selecting the
most informative data points within the dataset, and consequently reduce the energy demands
of their training phase.
   AL techniques are designed to find the most informative data for model training, born out of
the recognition that data labeling is one of the most resource-intensive and time-consuming
processes in AI model training. AL selects data points for labeling, typically by a human expert
annotator, to maximize learning efficiency and minimize the overall data labeling cost. Various
approaches have been defined for this purpose. For instance, Least Confidence Sampling (LCS)
[8] prioritizes items with the lowest confidence for their predicted label, while LAL-IGrad and
its enhancements [10, 11] exploit gradient variation within artificial neural networks to estimate
instance relevance. Additionally, Ash et al. [12] proposed BAIT that is a technique for selecting
batches of samples by optimizing a bound on the Maximum Likelihood Estimators (MLE) error
in terms of the Fisher information.
   In this paper we propose DFSMT, a versatile technique that combines various AL methodolo-
gies, to actively explore the data space within a pool-based framework, thus identifying the
most informative data for the model. AL techniques iteratively select the most informative
subset of labeled data to achieve acceptable model quality. To retain efficiency, the emphasis is
on computationally lightweight techniques; otherwise, the selection process could become more
resource-intensive than training the neural network itself. Experimental results demonstrate
that the proposed technique can significantly reduce the amount of labeled data required for
training AI models, while preserving high model quality. This outcome holds particular signifi-
cance within the perspective of Green-AI, as our technique offers a notable reduction in the
environmental impact of AI. It achieves this by significantly lowering the computational cost
associated with training AI models. Rather than relying on resource-intensive backpropagation
across neural networks, this technique selectively trains on a smaller, optimized dataset obtained
by exploiting AL techniques. This drastic reduction in energy and computational power usage
aligns with a more environmentally friendly approach to AI model training.


2. Related Work
In recent years, the field of machine learning has witnessed a growing interest in data reduc-
tion techniques. This interest is motivated by various needs, including the optimization of
computational resources, the reduction of the environmental impact of artificial intelligence
(Green-AI), and the improvement of model generalization. In this context, our work falls within
the research line that aims to reduce the amount of data required for training machine learning
models while maintaining high model quality. Several studies have explored data reduction
approaches in different contexts.
   For example, the MdBR [4] (Multidimensional binned reduction) method focuses on regression
tasks and uses discretization and non-parametric reduction techniques to achieve significant
data reduction (over 99%) while maintaining or even improving model performance. However,
MdBR is limited to static data and cannot handle time series. In the field of machine translation,
Lewis et al. [5] proposed an n-gram counting approach that reduces the size of datasets by up
to 90%, without a significant loss of quality (measured by the BLEU score [13]). This method is
scalable to large datasets and offers advantages beyond data reduction, such as faster training
times and smaller model sizes.
   Koggalage et al. [14] proposed a strategy that uses clustering techniques to identify and
remove irrelevant training samples that do not affect the decision boundary, this approach
allows to reducing the training set size without compromising classification accuracy, but it
is specific for SVM. Chouvatut et al. proposed the improved OPF (Optimum-Path Forest) [6]
method was developed to reduce the training set size for classifiers. This method is based on a
graph-based algorithm and a segmented linear regression approach to achieve a 7-21% reduction
in the training set size while maintaining similar accuracy (with a 0.2-0.5% decrease). In some
cases, the improved OPF even achieves the exact same accuracy as the original OPF algorithm.
   Yang et al. [15] proposed a method called incremental adaptive deep model (IADM) that
addresses the challenges of training deep models on streaming data with evolving distributions.
It employs an adaptive attention mechanism to adjust model depth and utilizes an attention-
based Fisher information matrix to prevent catastrophic forgetting, enabling efficient and
accurate learning on incremental data.
   Our work differs from previous ones in the following aspects:

    • Combination of different active learning (AL) strategies: DFSMT uses a combination
      of AL techniques, potentially offering greater flexibility and adaptability compared to
      single-strategy approaches;
     • Focus on Green-AI: Our work explicitly emphasizes environmental impact reduction as
       a key aspect of data reduction, a unique focus in the current landscape;
     • Potentially broader applicability: Our approach aims for broader applicability, not
       limited to a specific task or data type.

   By highlighting these strengths and comparing our work to related studies, we can effectively
position our research within the current landscape of data reduction techniques and emphasize
its potential contributions to Green-AI and other research fields. Our proposal contributes to
this line of research by combining different active learning techniques to identify the most
informative data points iteratively. This approach has the potential to further reduce the amount
of labeled data required for training high-quality AI models, contributing to more efficient and
environmentally friendly AI development.


3. Proposed Approach
A classification problem consists in associating every instance taken from a predefined domain
𝒟 with a label selected from a fixed domain of labels ℒ. We assume the presence of a set of
instance-label pairs 𝐿𝑆 ⊆ 𝒟 × ℒ, where for each pair ⟨𝑥, 𝑦⟩ ∈ 𝐿𝑆, 𝑥 is an instance in 𝒟 and 𝑦
is the label associated with 𝑥. Algorithm 1 shows the general schema of the proposed approach,
named Data Filtering for a Sustainable Model Training and algorithm 2 shows how the selection
is performed. DFSMT receives in input the dataset 𝐿𝑆, a neural network model NN, the number
𝑒𝑝𝑐ℎ of the training, the number 𝑠𝑡𝑒𝑝𝑠 of the selection process, 𝑝𝑠 the number of relevant
instances at start, 𝑝𝑛 the number of relevant instances to select at each step and 𝐴𝑆 the a set of
AL techniques. SelectionAlgorithm receives in input 𝐿𝑆 the instances not already selected in
the dataset, 𝑘 the number of relevant instances to select and 𝐴𝑆 the a set of AL techniques.
   The DFSMT algorithm starts by selecting a number of instances and placing them
in the 𝑇 𝑆 for initial training. The model iteratively learns: at each step, 𝑝𝑛 addi-
tional instances are added to the 𝑇 𝑆 using SelectionAlgorithm that receives as input 𝐿𝑆,
𝐴𝑆, a set of statistics about the samples needed for AL techniques (which may differ
from the techniques themselves), and 𝑝𝑛 . During each iteration, the model is updated/-
trained with both the new and existing instances. Finally, the trained model is returned.
 Algorithm 1: DFSMT
  Data: 𝐿𝑆: dataset, NN: neural network model, 𝑒𝑝𝑐ℎ: number of epochs, 𝑠𝑡𝑒𝑝𝑠: number of steps,
         𝑝𝑠 : number of relevant instances at start, 𝑝𝑛 : number of relevant instances to select at each
         step, 𝐴𝑆: a set of AL techniques
1 𝑇 𝑆 ← SelectionAlgorithm (𝐿𝑆, 𝑝𝑠 , 𝐴𝑆)
2 Train NN on 𝑇 𝑆 for 𝑒𝑝𝑐ℎ epochs
3 for 𝑖 = 1 . . . 𝑠𝑡𝑒𝑝𝑠 do
4     stats ← getStats(𝐿𝑆, NN, 𝐴𝑆)
5     𝑇 𝑆 ← 𝑇 𝑆 ∪ SelectionAlgorithm(LS,AS, stats,p𝑛 )
6     Train NN on 𝑇 𝑆 for 𝑒𝑝𝑐ℎ epochs
7   return NN
   The core of the proposed approach is the SelectionAlgorithm, which is responsible for
selecting the instances to be used for training. This algorithm combines the active learning
techniques present in the 𝐴𝑆 set. For each instance in 𝐿𝑆, the algorithm calculates a relevance
score and then combines them. Finally, the 𝑡𝑜𝑝𝑘 instances with the highest scores are selected
and returned. It is obvious that more techniques in 𝐴𝑆, more accurate the selection should be,
but at the expense of energy consumption and computation time.

    Algorithm 2: SelectionAlgorithm
      Data: 𝐿𝑆: not selected instances in the dataset, 𝐴𝑆: a set of AL techniques, stats: A set of data
            statistics necessary for 𝐴𝑆, 𝑘: number of relevant instances to select.
    1 𝑆 ← []
    2 for 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 ∈ LS do
    3     𝑠𝑐𝑜𝑟𝑒 ← 0
    4     for 𝑡𝑒𝑐ℎ𝑛𝑖𝑞𝑢𝑒 ∈ AS do
    5         𝑠𝑐𝑜𝑟𝑒 ← 𝑠𝑐𝑜𝑟𝑒 + 𝑓𝑡𝑒𝑐ℎ𝑛𝑖𝑞𝑢𝑒 (𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒, 𝑠𝑡𝑎𝑡𝑠𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒 )
    6     𝑆 ← 𝑆 ∪ 𝑠𝑐𝑜𝑟𝑒
7       𝑡𝑜𝑝𝐾 ← 𝑆𝑒𝑙𝑒𝑐𝑡𝑡𝑜𝑝−k𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠𝑓 𝑟𝑜𝑚LS𝑏𝑎𝑠𝑒𝑑𝑜𝑛𝑡ℎ𝑒𝑠𝑐𝑜𝑟𝑒𝑠𝑖𝑛S
8       𝐿𝑆 ← LS∖topK
9       return 𝑡𝑜𝑝𝐾


3.1. Computational reduction
Active learning (AL) offers a pathway to streamline AI model development while aligning
with the principles of Green-AI. The core concept lies in the strategic selection of the most
informative data samples from a larger labeled dataset. By training on this optimized subset,
AL techniques can reduce the overall computational costs associated with reaching a target
accuracy level. The potential for energy reduction is directly linked to the following factors:
        • Energy Cost per Data Point: The hardware used (CPUs, GPUs or TPUs) and the com-
          plexity of the neural network architecture dictate the energy expenditure on processing
          each data point during training. Optimizing algorithms for specific hardware can further
          reduce this cost;
        • Data Reduction Effectiveness: A core measure of AL effectiveness is its ability to
          drastically reduce the training set size while preserving model performance. The greater
          the reduction achievable, the higher the potential energy savings;
        • AL Complexity: Active learning techniques range in computational overhead. Simpler
          methods like uncertainty sampling may have minimal cost, while more sophisticated
          approaches can introduce higher computation, Indeed using some computationally in-
          tensive AL technique may render ineffective the proposed method, because the selection
          process can become more burdensome wrt the neural network’s training;
        • Impact on Training Convergence: The interaction between data reduction and the
          model’s convergence behavior cannot be ignored. In some cases, a highly informative
          dataset might lead to fewer training iterations, amplifying savings. However, it’s also
          possible that more iterations might be required to converge, partially offsetting the energy
          gains.
   The significance of energy conservation has long been recognized [16, 17, 18], leading to
ongoing advancements in power consumption estimation methodologies. Alongside these
theoretical developments, practical tools for building energy consumption modeling have
emerged. For the purpose of calculating energy savings, we employed the following formula,
established in the work of Lannelongue et al. (2020) [19]:

                    𝐸 = 𝑡 × (𝑛𝑐 × 𝑃𝑐 × 𝑢𝑐 + 𝑛𝑚 × 𝑃𝑚 ) × 𝑃 𝑈 𝐸 × 0.001                            (1)
  Where:

    • 𝑡: is the running time (hours);
    • 𝑛𝑐 : the number of cores;
    • 𝑛𝑚 : the size of memory available (gigabytes);
    • 𝑢𝑐 : the core usage factor (between 0 and 1);
    • 𝑃𝑐 : the power draw of a computing core;
    • 𝑃𝑚 : the power draw of the memory (Watt);
    • 𝑃 𝑈 𝐸: is the efficiency coefficient of the data centre.


4. Experimental Evaluation
Data. We used the following dataset to execute the experimental evaluation:

    • MNIST [20]: which consists of 60000 instances representing 28x28 gray scale images,
      labeled using 10 mutually exclusive classes, with 6000 images per class. The dataset is
      organized into 60000 instances as the training set and 10000 instances as the test set. The
      latter contains exactly 1000 randomly-selected images from each class, while the training
      set is comprised of five training batches, which contain 6000 images from each class;
    • Fashion-MNIST [21]: which consists of 60000 instances representing 28x28 gray scale
      images, labeled using 10 mutually exclusive classes, with 6000 images per class. The
      dataset is organized into 60000 instances as the training set and 10000 instances as the
      test set. The author intends Fashion-MNIST to serve as a direct drop-in replacement for
      the original MNIST dataset for benchmarking machine learning algorithms. It shares the
      same image size and structure of training and testing splits.

Baseline methods. We compared the performance of DFSMT with a classical training approach
that uses all the data available in the dataset. This allowed us to evaluate how our technique
reduces the amount of data required to achieve comparable performance to classical training,
measured in terms of model accuracy. As AL technique we utilized the LCS technique due to its
light weight capabilities. However, this does not preclude the use of other techniques or their
combination. More precisely, given an instance 𝑥 and a classification model 𝜃, the LCS method
                                                                              𝑚
measures the uncertainty of 𝑥 w.r.t. 𝜃 (𝜑(𝑥)) as 𝜑(𝑥) = (1 − 𝑃𝜃(𝑦* |𝑥) ) × 𝑚−1   , where 𝑃𝜃(𝑦* |𝑥)
denotes the probability that the model 𝜃 assigns to the label 𝑦 for the instance 𝑥, 𝑦 * is the label
                                                               *

for which 𝜃 yields the maximum probability on 𝑥 (i.e., 𝑦 * = arg max𝑦 𝑃𝜃(𝑦|𝑥) ), and 𝑚 is the
cardinality of the set of labels. Note that the uncertainty function ranges between [0, 1], where
1 is the most uncertain score.
Settings and assessment criteria. To evaluate the effectiveness of DFSMT, we conducted
experiments on two standard image datasets just described. For each dataset, we used the
following neural networks:

       • MNIST: A CNN with two convolutional layers (10 and 20 filters, respectively), followed
         by a dropout layer and two fully connected layers (50 and 10 neurons);
       • Fashion-MNIST: This CNN architecture starts with two convolutional layers, each using
         3x3 filters for local pattern extraction. Batch normalization speeds up training, and ReLU
         activations provide non-linearity. Max pooling reduces dimensionality. Fully connected
         layers then interpret the features, with dropout preventing overfitting. The final 10-output
         layer likely corresponds to a 10-class classification task.

   The stochastic gradient descent (SGD) [22] optimization algorithm was used to optimize the
model parameters of the neural network for MNIST, chosen due to its efficiency and reliability
in a variety of machine learning problems. For Fashion-MNIST, however, the Adam [23]
optimization algorithm was selected, potentially due to its faster convergence and adaptability
to complex datasets.
   For MNIST the negative log-likelihood (nll_loss) loss function was used. This function is
specific to the multi-class classification. It measures how closely the model predictions align
with the ground truth labels. For Fashion-MNIST, which is a multi-class classification problem,
as the previous ones, the cross-entropy loss (CrossEntropyLoss) function was used. This function
measures the distance between two probability distributions and has been shown to be effective
for classification problems with a high number of classes.
   Classical training involves using the entire dataset to train the model in a single phase, doing
100 training epochs. This approach can be computationally expensive and require significant
training time, especially for large datasets and models. Incremental training, on the other
hand, adopts an iterative approach. Initially, a small subset of the dataset is used to train the
model (1000 samples), subsequently, the model is updated incrementally with new data acquired
iteratively (1000 samples) per 10 incremental steps, in which are performed 10 training epochs.
This approach can significantly reduce the training time, energy consumption and the amount
of data required, while maintaining high model accuracy.
   We analyzed how the behavior of DFSMT changes when varying the amount of data selected
at each training step with the MNIST dataset 1 . Table 1 summarizes our analysis and figure
1 shows them. It includes the amount of data selected at each step of the process, the final
amount of data used at the end of training, the model’s accuracy, average CPU utilization (note
that values exceeding 100% indicate multi-core usage), processing time in milliseconds, energy
consumption (expressed in kWh) calculated using equation 1, and a metric relating accuracy to
energy efficiency (efficiency ratio) calculated as 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦/𝑒𝑛𝑒𝑟𝑔𝑦.
   Then we analyzed the accuracy and loss curves during both classical and incremental training.
This allowed us to monitor the model’s learning in both cases, comparing its evolution with
1
    Experiments were carried out on an Intel Core i5 CPU @2.30GHz 8259U, 8GB RAM, with Intel Iris Plus Graphics
    655 GPU
 instances per step                                        tot. instances             accuracy             CPU             time    energy                                     efficiency ratio
         500                                                     6000                  73.18%              108%           512463    0.011                                          6558.32
        1000                                                    11000                  86.47%              112%           519593    0.011                                          7550.01
        1500                                                    16000                  89.76%              112%           559421    0.012                                          7268.01
        2000                                                    21000                  92.75%              114%           594989    0.013                                          7025.61
        2500                                                    26000                  93.58%              114%           644746    0.014                                          6541.62
        3000                                                    31000                  94.41%              112%           708402    0.016                                          6045.24
Table 1
This table shows the relationship between the number of samples selected per training step and
corresponding changes in computer usage parameters.


                        0,020                                                       100                                                                      8000


                                                                                                                                   consumption on accuracy
  Energy consumption


                        0,015                                                        75                                                                      6000
                                                                         Accuracy


                        0,010                                                        50                                                                      4000


                        0,005                                                        25                                                                      2000


                        0,000                                                         0                                                                         0
                                10000    15000   20000   25000   30000                    10000   15000   20000   25000   30000                                       10000     15000   20000   25000   30000


                                        n used instances                                          n used instances                                                             n used instances


                       (a) energy vs instances                                 (b) accuracy vs instances                                                            (c) effectiveness
Figure 1: Graph (a) demonstrates how energy requirements increase alongside the number of instances,
graph (b) similarly illustrates the rise in accuracy with instance quantity, graph (c) visualizes the
relationship between these two trends.


full and reduced data sets. Accuracy is the primary metric for evaluating a model’s ability
to correctly classify images. The loss measures the model’s error in predicting labels. By
monitoring the loss during training, we can evaluate the model’s ability to learn from the data
and improve its predictions.
Results. The analysis focuses on three key aspects: computational savings, accuracy and
loss, comparing the performance of DFSMT with classical training on two datasets of varying
complexity: MNIST and Fashion-MNIST. As observed in Table 1, increasing the number of
training instances naturally leads to higher accuracy and energy consumption. Our experiments
aimed to identify the optimal parameters for maximizing the accuracy-energy consumption
relationship. We determined that the “n instances per step" parameter is the primary influencing
factor, with 1000 instances yielding the best results. Consequently, we used this parameter for
our comparative analysis against classical training. While classical training achieved slightly
higher accuracy (96.58% vs. 94.41%), its energy consumption was significantly greater (0.027
kWh vs. 0.011 kWh). This translates to a superior efficiency ratio of DFSMT of 7550.01
compared to 3642.04 with classical training. Figure 1 clearly demonstrates the differing growth
patterns of accuracy and energy consumption. While accuracy increases logarithmically, energy
consumption follows a different trajectory. This highlights the inherent trade-off between these
two metrics, emphasizing the need to carefully select parameters for the most efficient model
training.
   DFSMT demonstrated remarkable potential on the Fashion-MNIST dataset. It achieved a
significantly higher efficiency ratio (3737.75 vs. 663.19 with classical training) and drastically
reduced energy consumption (0.024 kWh vs. 0.134 kWh) while maintaining comparable accuracy
(89.21% vs. 89.62%). These results, obtained under identical MNIST settings, underscore DFSMT’s
advantages. By comparing the accuracy trends during classical and incremental training, we
observed:

    • Classical Training: Accuracy increased gradually with the number of epochs, reaching
      a plateau towards the end of training;
    • Incremental Training: DFSMT exhibits a faster learning rate (i.e., steeper upward
      trajectory) than classical training on MNIST as the number of training examples increases.
      On Fashion-MNIST, this difference is less pronounced.

   Our analysis of accuracy and loss validates DFSMT’s ability to reduce energy consumption
in machine learning training. Even with less data, incremental training achieved comparable
accuracy to classical training, demonstrating its potential as a more efficient and sustainable
approach.

                                              Simple        Incremental                                                  Simple        Incremental
                           100,00                                                                     100,00


                            75,00                                                                      75,00
                Accuracy


                                                                                           Accuracy


                            50,00                                                                      50,00


                            25,00                                                                      25,00


                             0,00                                                                       0,00
                                    10   20     30     40   50   60   70   80   90   100                       10   20     30     40   50   60   70   80   90   100

                                                            epochs                                                                     epochs


                                                 (a) mnist                                                     (b) fashion-mnist

Figure 2: These two graphs (one for each dataset used) compare the accuracy between classical training
and our proposed incremental approach at the varying of the training epochs.


   Our analysis reveals that classical training converges to the optimum faster than DFSMT, as
evidenced by both loss curve and accuracy trends. While DFSMT’s loss curve initially shows
slightly less stability due to less training data, it eventually stabilizes as the number of training
instances increases.
   DFSMT stands out for its significant computational savings compared to classical training.
The advantage becomes more pronounced with increasing dataset’s instances size.


5. Conclusion
Based on the conducted analysis, we can confidently state that DFSMT represents an efficient
and performant machine learning method for handling large datasets. The algorithm offers
significant computational savings compared to classical training, without notable sacrificing
model accuracy. The computational efficiency of DFSMT makes it a promising solution for
machine learning on resource-constrained devices, and also in the context of Green AI, which
is becoming increasingly important due to the climate crisis. Moreover, its ability to handle
                                        Simple      Incremental                                          Simple         Incremental
                       2,50
                                                                                         8,00E-02
                       2,00
                                                                                         6,00E-02
                       1,50


                Loss


                                                                                  Loss
                                                                                         4,00E-02
                       1,00
                                                                                         2,00E-02
                       0,50
                                                                                         0,00E+00
                       0,00
                              10   20   30    40   50   60   70   80   90   100                     10   20   30   40    50   60      70   80   90 100

                                                   epochs                                                                epochs


                                             (a) mnist                                         (b) fashion-mnist

Figure 3: These two graphs (one for each dataset used) compare the loss between classical training and
our proposed incremental approach at the varying of the training epochs.


large datasets opens up new possibilities for the use of machine learning models in a variety
of applications, with a positive impact on the efficiency and sustainability of such systems. At
the led of these results we continue the research in this direction making some improvements
to DFSMT exploiting for example the information supplied from the dataset as the label (in
contrast of a simple AL setting) and applying some optimizations to the selected data in order
to keep the dataset balanced. Building upon these findings, our future research endeavors will
focus on refining DFSMT by leveraging dataset-specific information such as the label of the
instances, diverging from simple active learning settings, and implementing optimizations to
maintain dataset balanced. These enhancements aim to further elevate the performance and
versatility of DFSMT, fostering its broader adoption across diverse domains and reinforcing its
role in advancing both efficiency and sustainability in machine learning practices.


Acknowledgement
This work was partly supported by project FAIR - Future AI Research - Spoke 9 (Directorial
Decree no. 1243, August 2nd, 2022; PE 0000013; CUP B53C22003630006), under the NRRP
(National Recovery and Resilience Plan) MUR program (Mission 4, Component 2 Investment
1.3) funded by the European Union – NextGenerationEU.


References
 [1] A. de Vries, The growing energy footprint of artificial intelligence, Joule 7 (2023) 2191–2194.
     URL: https://www.sciencedirect.com/science/article/pii/S2542435123003653. doi:https:
     //doi.org/10.1016/j.joule.2023.09.004.
 [2] E. Strubell, A. Ganesh, A. McCallum, Energy and policy considerations for deep learning in
     NLP, in: A. Korhonen, D. R. Traum, L. Màrquez (Eds.), Proceedings of the 57th Conference
     of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August
     2, 2019, Volume 1: Long Papers, Association for Computational Linguistics, 2019, pp. 3645–
     3650. URL: https://doi.org/10.18653/v1/p19-1355. doi:10.18653/V1/P19-1355.
 [3] S. Flesca, F. Scala, E. Vocaturo, F. Zumpano, On forecasting non-renewable energy pro-
     duction with uncertainty quantification: a case study of the italian energy market, Expert
     Systems with Applications 200 (2022). URL: http://www.sciencedirect.com/science/article/
     pii/S0957417422003670. doi:http://doi.org/10.1016/j.eswa.2022.116936.
 [4] J. Wibbeke, P. Teimourzadeh Baboli, S. Rohjans, Optimal data reduction of training data in
     machine learning-based modelling: A multidimensional bin packing approach, Energies
     15 (2022). URL: https://www.mdpi.com/1996-1073/15/9/3092. doi:10.3390/en15093092.
 [5] W. Lewis, S. Eetemadi, Dramatically reducing training data size through vocabulary
     saturation, in: Proceedings of the Eighth Workshop on Statistical Machine Translation,
     WMT@ACL 2013, August 8-9, 2013, Sofia, Bulgaria, The Association for Computer Lin-
     guistics, 2013, pp. 281–291. URL: https://aclanthology.org/W13-2235/.
 [6] V. Chouvatut, W. Jindaluang, E. Boonchieng, Training set size reduction in large dataset
     problems, in: 2015 International Computer Science and Engineering Conference (ICSEC),
     2015, pp. 1–5. doi:10.1109/ICSEC.2015.7401435.
 [7] B. Settles, Active Learning Literature Survey, Technical Report, University of Wisconsin-
     Madison Department of Computer Sciences, 2009.
 [8] B. Settles, M. Craven, An analysis of active learning strategies for sequence labeling tasks,
     in: Proc. of the 2008 Conference on Empirical Methods in Natural Language Processing,
     2008, pp. 1070–1079.
 [9] S. Kee, E. del Castillo, G. Runger, Query-by-committee improvement with diversity and
     density in batch active learning, Information Sciences 454-455 (2018) 401–418. URL:
     https://www.sciencedirect.com/science/article/pii/S0020025518303700. doi:https://doi.
     org/10.1016/j.ins.2018.05.014.
[10] S. Flesca, D. Mandaglio, F. Scala, A. Tagarelli, A meta-active learning approach exploiting
     instance importance, Expert Systems with Applications 247 (2024) 123320. URL: https:
     //www.sciencedirect.com/science/article/pii/S0957417424001854. doi:https://doi.org/
     10.1016/j.eswa.2024.123320.
[11] S. Flesca, D. Mandaglio, F. Scala, A. Tagarelli, Learning to active learn by gradient vari-
     ation based on instance importance, in: 2022 26th International Conference on Pattern
     Recognition (ICPR), 2022, pp. 2224–2230. doi:10.1109/ICPR56361.2022.9956039.
[12] J. T. Ash, S. Goel, A. Krishnamurthy, S. M. Kakade, Gone fishing: Neural active learning
     with fisher embeddings, in: M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, J. W.
     Vaughan (Eds.), Advances in Neural Information Processing Systems 34: Annual Con-
     ference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14,
     2021, virtual, 2021, pp. 8927–8939. URL: https://proceedings.neurips.cc/paper/2021/hash/
     4afe044911ed2c247005912512ace23b-Abstract.html.
[13] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation
     of machine translation, in: Proceedings of the 40th annual meeting on association for
     computational linguistics, Association for Computational Linguistics, 2002, pp. 311–318.
[14] R. Koggalage, S. K. Halgamuge, Reducing the number of training samples for fast sup-
     port vector machine classification, 2004. URL: https://api.semanticscholar.org/CorpusID:
     6688904.
[15] Y. Yang, D.-W. Zhou, D.-C. Zhan, H. Xiong, Y. Jiang, Adaptive deep models for incremental
     learning: Considering capacity scalability and sustainability, in: Proceedings of the 25th
     ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD
     ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 74–82. URL:
     https://doi.org/10.1145/3292500.3330865. doi:10.1145/3292500.3330865.
[16] E. García-Martín, C. F. Rodrigues, G. Riley, H. Grahn, Estimation of energy consumption
     in machine learning, Journal of Parallel and Distributed Computing 134 (2019) 75–88.
     URL: https://www.sciencedirect.com/science/article/pii/S0743731518308773. doi:https:
     //doi.org/10.1016/j.jpdc.2019.07.007.
[17] D. A. Patterson, J. Gonzalez, Q. V. Le, C. Liang, L.-M. Munguía, D. Rothchild, D. R. So, M. Tex-
     ier, J. Dean, Carbon emissions and large neural network training, ArXiv abs/2104.10350
     (2021). URL: https://api.semanticscholar.org/CorpusID:233324338.
[18] J. Xu, W. Zhou, Z. Fu, H. Zhou, L. Li, A survey on green deep learning, ArXiv abs/2111.05193
     (2021). URL: https://api.semanticscholar.org/CorpusID:243861089.
[19] L. Lannelongue, J. Grealey, M. Inouye, Green algorithms: Quantifying the carbon footprint
     of computation, Advanced Science 8 (2021) 2100707.
[20] L. Deng, The mnist database of handwritten digit images for machine learning research,
     IEEE Signal Processing Magazine 29 (2012) 141–142.
[21] H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist: a novel image dataset for benchmarking
     machine learning algorithms, CoRR abs/1708.07747 (2017). URL: http://arxiv.org/abs/1708.
     07747. arXiv:1708.07747.
[22] S. Ruder, An overview of gradient descent optimization algorithms, 2017.
     arXiv:1609.04747.
[23] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, 2017. arXiv:1412.6980.

</pre>