<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Layer-wise Quantization in Green-Aware AI⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Farwa Ikram</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dipanwita Thakur</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonella Guzzo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giancarlo Fortino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The University of Calabria</institution>
          ,
          <addr-line>Arcavacata di Rende</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Deep neural networks (DNN) have become integral to many artificial intelligence (AI) applications due to their superior performance across a wide range of tasks. However, their deployment in edge devices is limited by high computational and energy demands, which also increase carbon emissions. Quantization, particularly layer-wise quantization, has emerged as a promising technique to mitigate these limitations by reducing numerical precision and computational overhead. In this paper, we present a comprehensive study on the efects of some layer-wise quantization strategies on the energy eficiency and the environmental impact of DNNs, with a specific focus on green-aware AI. We examine uniform, ascending, and descending quantization schemes to understand how per-layer bit-width assignments can be optimized for minimal energy usage while preserving model accuracy. Our analysis includes a quantitative assessment of energy consumption and associated 2 emissions. Experiments conducted on standard benchmarks, ResNet18 trained on CIFAR-10 and CIFAR-100, demonstrate that our method achieves up to a 45% reduction in energy consumption and memory usage with competitive degradation in accuracy, ofering a practical solution for sustainable AI deployment.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Green AI</kwd>
        <kwd>Layer-wise quantization</kwd>
        <kwd>Energy Eficiency</kwd>
        <kwd>Carbon Emission</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Deep neural networks (DNNs) have revolutionized the field of machine learning, enabling breakthroughs
in computer vision [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], natural language processing [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and autonomous systems [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Despite their
success, the computational intensity and memory footprint of DNNs have raised significant concerns,
especially for deployment on resource-constrained edge devices such as mobile phones, embedded
systems, and IoT devices [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. Moreover, the environmental cost of training and deploying large-scale
DNNS, including energy consumption and 2 emissions has drawn increasing scrutiny for both
academia and industry due to sustainable development goals (SDGs) defined by the United Nations
[
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ].
      </p>
      <p>
        Quantization is a well established technique for reducing the complexity of DNNs by lowering the
precision of weights and activations [
        <xref ref-type="bibr" rid="ref2 ref8">8, 2</xref>
        ]. While traditional quantization approaches typically apply
a uniform bit-width across all layers [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], recent research suggests that a more nuanced, layer-wise
strategy may better balance accuracy and eficiency. This paper explores the design and impact of some
layer-wise quantization schemes, with a particular emphasis on green-aware computing, i.e., reducing
the environmental impact of AI systems without sacrificing performance.
      </p>
      <p>We investigate three layer-wise quantization strategies: uniform quantization, ascending quantization
(lower to higher bit-width from shallow to deeper layers), and descending quantization (higher to
lower bit-width). Through empirical analysis and benchmarking, we evaluate their efects on model
performance, memory usage, energy consumption, and 2 emissions. Our work provides new insights
into how intelligent bit-width allocation can yield more sustainable AI solutions.</p>
      <sec id="sec-1-1">
        <title>1.1. Motivation</title>
        <p>
          The motivation of this work stems from the growing need to reconcile the performance of deep learning
models with environmental and deployment constraints. Many real-world applications require the
deployment of DNNs on edge devices, which are limited in terms of computational power, memory,
and battery life. Reducing the energy and memory demands of models is essential for feasible and
responsive edge deployment. Training and deploying DNNs have a non-trivial environmental footprint.
Studies show that model inference at scale contributes significantly to carbon emissions. Developing
quantization strategies that minimize energy use can directly contribute to reducing this impact. While
uniform quantization ofers simplicity, it may not be optimal. Diferent layers in a network have
varying sensitivity to quantization, suggesting that intelligent, layer-specific strategies could yield
better trade-ofs between eficiency and accuracy [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Previous work has not suficiently explored the
impact of diferent layer-wise quantization strategies on green metrics such as energy consumption and
2 emissions. This study fills that gap by providing a systematic comparison and practical insights.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. Contribution</title>
        <p>This paper makes the following key contributions:
• We systematically evaluate three distinct layer-wise quantization approaches, such as uniform,
ascending, and descending, analyzing their trade-ofs in terms of energy eficiency, memory usage,
and classification accuracy. We also analyze the energy and environmental impact of layer-wise
quantization strategies in DNNs, focusing on reducing 2 emissions while maintaining model
performance.
• We present a quantitative assessment of energy savings and corresponding reductions in 2
emissions resulting from each quantization strategy, enabling a clearer understanding of their
environmental benefits.
• Extensive experiments on two diferent flavors of ResNet18 using CIFAR-10 and CIFAR-100
demonstrate that our proposed strategies can reduce energy consumption and memory footprint
by up to 45%, with competitive degradation in accuracy.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>To achieve the objectives of Green-Aware AI, extensive work has been conducted on layer-wise
quantization, which can lead to reduced communication and computing overhead. This method allows for
control precision by layer, which allows for fine-tuning trade-ofs between memory access, size, and
inference costs, resulting in a significant reduction of the environmental impact of deep learning models
with little or no decline in performance. In [21], the authors proposed a layer-wise quantization method
that assigns diferent bit-widths to layers based on their importance, using two scoring metrics, one
measuring changes in input-output representations and another based on weight distribution. Applied with
techniques like GPTQ and Quanto, this approach preserved over 90% accuracy even when compressing
models like LLama-13B to an average of 3.25 bits, outperforming uniform quantization and moderate
pruning in both classification and reasoning tasks. In [ 17], a lightweight multilayer perceptron (MLP)
model was introduced for speech emotion recognition (SER) with a layer-wise adaptive quantization
(LAQ) scheme to compress the size of the model while maintaining accuracy. Their approach applies
diferent bit-widths to layers with scores based on parameter proportion, entropy, and weight variance.
In another work [13], Gluska and Grobman proposed a method to identify which layers of a neural
network contribute the most to quantization errors by quantizing one layer at a time. Their analysis
showed that performance loss is often caused by a small number of layers. Using this insight, they
applied targeted fixes such as clipping only the most problematic layer, which improved the accuracy
of ResNet26 from 74.28% to 75.91%, outperforming global quantization techniques.</p>
      <p>Model</p>
      <p>Method</p>
      <p>Bit Alloca- layer-wise
Quantition Strategy zation Basis
Descending Parameter count per</p>
      <p>layer
VGG-16,
ResNet18
LLama2
LLama-13B
ResNeXt26</p>
      <p>RL-based
MixedPrecision
Quantization
Error
Propagation (QEP)
Layer-wise
Quantization
via
Importance Scores
Layer-wise Er- Fixed
ror Analysis</p>
      <p>Mixed
ImageNet LTNN
(Tensorized
Decomposition)
WGANs Adaptive</p>
      <p>Layer-wise</p>
      <p>Quantization
VGG19, Hybrid
QuanResNet18/34 tization and</p>
      <p>Pruning
TESS, Layer-wise
EMODB, Adaptive
SAVEE Quantization</p>
      <p>for SER
LLaMA mod- SensiBoost,
els KurtBoost
ResNet18
ImageNet
(AlexNet)</p>
      <p>Edge-MPQ
with RISC-V
Entropybased
layerwise
Quantization</p>
      <p>Descending
Mixed
precision per layer
Adaptive
Descending
Descending
Descending
Mixed
precision
Descending</p>
      <p>Error sensitivity
across layers
Layer importance
via input-output
changes and weight
distribution
Empirical error
sensitivity (quantize one
layer at a time)
Tensor
decomposition structure
Gradient noise
model and
convergence behavior
Statistical metrics
(entropy, variance,
sparsity)
Parameter
proportion, entropy, weight
variance
Activation sensitivity
and kurtosis</p>
      <p>Results
11× compression,
93.5% accuracy vs
93.6%
Accuracy improved
from 67.9% to 70.1%
with 2-bit GPTQ
Over 90% accuracy at
3.25 avg bits
Accuracy improved
from 74.28% to
75.91%
Up to 64A
compression, 13.95% top-5
error at 35.84A
Up to 47% faster
training as compared
to Q-GenX
91% accuracy, avg.
1.08 bit-width, up to
81% sparsity
Up to 99.29%
accuracy
Up to 9% lower
perplexity with only 2%
more memory
47.67A speedup, 6.7%
higher PTQ accuracy
Hardware
constraints and training
sensitivity
Weight and activa- 90.28% accuracy,
tion entropy 45.64% top-1
(ImageNet)</p>
      <p>
        Similarly, in [18], a layer-sensitive quantization strategy for large language models that selectively
assigns higher precision to layers identified as more sensitive to quantization errors. They developed
SensiBoost and KurtBoost to guide memory allocation at the layer level. These methods improved
quantization accuracy while staying within a modest memory budget. For instance, SensiBoost reduced
perplexity by up to 9% on LLaMA models with only a 2% increase in memory. In [19], a novel hardware,
layer-wise mixed-precision quantization (MPQ) framework tailored for edge computing, combining both
eficient inference hardware and novel quantization search algorithms. Their design integrates versatile
inference units supporting INT2 to INT16 directly into a RISC-V processor pipeline, achieving a speed
up of up to 47.67× and energy eficiency of 20.51 TOPS/W over the baseline RV64IMA core. The authors
in [20] presented an adaptive layer-wise quantization method that assigns bit-widths to each layer
based on its importance, measured using the entropy of weights and activations. To stabilize activation
distribution, they applied L2 regularization during training, enabling more efective compression. Their
approach outperformed fixed-bit quantization methods in both accuracy and compression eficiency.
On CIFAR-10, they achieved 90.28% accuracy with 27.5× compression for weights, and 86.14% accuracy
with joint weight-activation quantization. In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the authors proposed a reinforcement learning-based
framework to perform mixed-precision quantization across deep neural network layers. Instead of
applying a fixed bit-width uniformly across all layers, their method leverages the parameter count of
each layer to guide the bit allocation process. By using Deep Q-Learning, the system learns to assign
diferent bit-widths to each layer in a way that balances accuracy with model size reduction. For example,
VGG-16 on CIFAR-10 was compressed from 59.91MB to 5.57MB with only a 0. 1% drop in precision (93.
6% to 93. 5%). A novel approach called quantized optimistic dual averaging (QODA) was proposed in
[15] for distributed variational inequality problems. Their approach provides theoretical guarantees on
both quantization error and communication cost, and generalizes previous global quantization results.
When applied in training WGAN on CIFAR-10 and CIFAR-100 among 4 GPUs, their method not only
provided better accuracy than Q-GenX baseline but also accelerated the training process by up to 47%,
which shows the theoretical and empirical advantages in distributed optimization. A hybrid approach
of quantization and pruning via adaptation based upon the statistical significance of the neural network
layer was proposed in [16]. The method adaptively optimizes bit width and pruning thresholds on a
per-layer basis using measures such as entropy, variance, and sparsity to minimize model size without
sacrificing accuracy. For VGG19, ResNet18, and ResNet34 on CIFAR-10, the method maintains accuracy
over 91%, and averages bit-width as small as 1.08, and can induce parameter sparsity up to 81%. In [14],
a compression technique was developed known as LTNN, which reshapes neural network weights into
high-dimensional tensors and applies low-rank tensor train decomposition to employ in a layer-wise
implementation for training to save model size and, at the same time, preserve accuracy. Additionally,
they suggested quantizing tensor cores for more compression. On MNIST and CIFAR-10 benchmarks,
LTNN can outperform state-of-the-art methods with compression and 21.57 x 2.2 loss compression. On
ImageNet, they reported 8.8× compression with 13.15% top-5 error, which increased to 13.95% error at
35.84× compression with quantization. We summarize the proposed layer-wise quantization methods
in Table 4.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Quantization plays a crucial role in the eficient deployment of machine learning (ML) models and
has consequently become a prominent focus in current research. One widely recognized technique
for model compression that efectively reduces computational complexity and storage requirements is
model quantization [22]. This approach involves transforming high-precision floating-point numbers,
such as 32-bit representations, into lower-precision integers, including formats like 8-bit or 4-bit, and
even smaller sizes. While quantization provides significant benefits in terms of computational eficiency
and reduced storage demand, it also presents challenges related to precision loss [23]. The sensitivity of
various layers within a model to quantization can difer significantly, and conventional fixed-precision
quantization methods may adversely afect the performance of critical layers, thereby impacting the
model’s overall inference capabilities. To mitigate these issues, researchers have been developing
mixed-precision quantization techniques. These methods strategically allocate diferent bit lengths to
individual layers, striking a balance between resource utilization and model performance, ultimately
enhancing the efectiveness of the quantization process. We analyze the impact of fixed bit layer-wise
quantization, as well as the diferent levels of precision to individual layers of a neural network, allowing
for a more fine-grained trade-of between model accuracy and computational eficiency. The graphical
representation of explained methodology is shown in Figure 1 .</p>
      <sec id="sec-3-1">
        <title>3.1. Layer-wise Quantization</title>
        <p>Layer-wise quantization is a model compression approach that quantizes diferent layers of a neural
network with diferent numerical precisions. Rather than utilizing an equal bit-width policy across
all layers, our solution stands for selective quantization, in which diferent layers are quantized in a
ifne-grained manner that corresponds to the importance of the layer’s contribution to the overall model’s
performance. Less important or more robust layers can usually be quantized to a smaller number of bits
(2-bit or 3-bit), while sensitive layers are kept at higher precision to avoid loss in accuracy. This technique</p>
        <sec id="sec-3-1-1">
          <title>Algorithm 1 layer-wise Quantization Strategy</title>
          <p>optimizes the trade-of of computational cost and the model quality by focusing its quantization on the
most beneficial places. For example, in convolutional networks, the first or middle layers can tolerate a
higher level of aggressive quantization as they’re naturally redundant in the extraction of the features,
while the later layers are often placed near the decision boundary, which requires higher precision for
classification accuracy. The quantization of each layer is conducted layer-wise. A fixed-bit quantization
is that all of the selected layers use the same bit-width. Existing adaptive quantization techniques
employ ascending-trend, where the quantization level rises as training progresses and descending-trend
quantization, where the quantization level decreases as training progresses, in addition to fixed-bit
quantization. The algorithm 1 shows the steps of the layer-wise quantization method in which a neural
network  consisting of  layers, arranged in topological order as 1, 2, . . . , . Each layer typically
performs a standard operation such as convolution, fully connected computation, or pooling. In this
method, three specific layers, denoted as 1 , 2 , and 3 , are selected for quantization, while the
remaining layers are retained in full precision. Lines 5-9 explore three configurations:(1) ascending
quantization, where layers are quantized using increasing precision levels (2) descending quantization,
where decreasing precision is applied across layers, and (3) fixed-bit quantization , in which all selected
layers are quantized using the same bit-width. Lines 13-14 show application of activation quantization
to its weights and its input/output for each quantized layer. Algorithm 1 continues to run until either it
converges or satisfies a predefined condition, such as achieving 95% accuracy.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Customized ResNet18 for CIFAR-10</title>
        <p>
          ResNet18 [24] is one of the popular DNN-based model for image classification. The architecture of
ResNet18 used for image classification is represented in Table 2, known as ResNet-CIFAR. The
ResNetCIFAR architecture begins with an initial convolutional layer (Conv1) that applies a 33 convolution to
the input RGB image, increasing the channel depth from 3 to 16. This is followed by batch normalization
and a ReLU activation function, preserving the spatial resolution of 3232. The first major stage, Layer1,
consists of multiple residual blocks (as specified by layers[0]) and operates at a constant resolution of
3232, maintaining 16 output channels. Layer2 introduces spatial downsampling via a stride of 2 in the
ifrst residual block, reducing the feature map size to 16 ×16 and increasing the number of channels to 32;
it contains layers[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] residual blocks. Similarly, Layer3 performs another downsampling step, reducing
the resolution to 88 while increasing the channel count to 64, with layers[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] residual blocks. Following
these stages, a global average pooling layer aggregates each 8×8 feature map into a single scalar value,
resulting in a 64-dimensional feature vector. Finally, a fully connected (FC) layer maps this vector to
the desired number of output classes, producing the final classification logits. Notably, each residual
layer group can employ a diferent quantization bit-width, allowing for layer-wise precision control
that balances accuracy and eficiency. 3 layers consist of several residual blocks are quantized using
equal, ascending, and descending bit precision.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>We conducted layer-wise quantization on the CIFAR-10 and CIFAR-100 [25] datasets, involving 400
epochs with a learning rate of 0.1. CIFAR-10 and CIFAR-100 each consist of 60,000 RGB images of
size 32×32 pixels. However, CIFAR-10 is categorized into 10 classes, while CIFAR-100 includes a more
ifne-grained classification across 100 distinct classes. This configuration was aimed at evaluating how
well a ResNet18 model performed while employing the layer-wise quantization technique, evaluating
model eficiency, time, loss, and accuracy. The network was optimized using the SGD optimizer with
a momentum of 0.9. We also conducted experiments using the full-precision model to analyze its
energy consumption, enabling a comparative assessment of the eficiency gains achieved through
(a) Train Accuracy
(b) Train Loss
(c) Test Accuracy
(d) Test Loss
(a) Train Accuracy
(b) Train Loss
(c) Test Accuracy
(d) Test Loss
quantization. This quantization phase proved to be a key factor in decreasing the energy cost of the
model’s execution. We used the CodeCarbon software package [26] to accurately track the energy use
and corresponding carbon emissions associated with the quantization technique. The experiments were
performed using an Intel(R) Xeon(R) W-2265 CPU 3.50GHz, which has a Linux operating system of 76
cores and supports Metal 3, the latest graphics application programming interface (API) designed to
boost graphics rendering on the platform.</p>
      <sec id="sec-4-1">
        <title>4.1. Experimental Results</title>
        <p>Table 3 presents a comparative analysis of diferent quantization strategies on CIFAR-10 and CIFAR-100
using ResNet18. Full precision achieves the highest accuracy but incurs the highest energy consumption
and carbon emissions. Among quantized approaches, the ascending bit-width strategy consistently
outperforms descending and fixed-bit configurations in both accuracy and loss, indicating that allocating
higher precision to deeper layers improves performance. To ensure robustness, each experiment was
conducted for 400 rounds. Performance is reported including both mean and standard deviation,
as calculated using the statistical functions. These values reflect the variability observed within a
single training run, based on the measurements collected throughout the training process. In particular,
quantized methods significantly reduce energy and emissions, highlighting their potential for sustainable
deployment, while maintaining competitive accuracy, especially on the less complex CIFAR-10 dataset.
Figures 2 and 3 illustrate the training and testing accuracy and loss trends over communication rounds
for CIFAR-10 and CIFAR-100, respectively. In CIFAR-10 (Figure 2), all quantization schemes converge
stably, with full precision performing best, and ascending quantization closely following. Fixed-bit
and descending strategies show slightly degraded accuracy and slower convergence. In CIFAR-100
(Figure 3), the performance gap widens, indicating greater sensitivity to quantization due to increased
task complexity. Notably, ascending quantization again outperforms fixed and descending bit schemes,
supporting the hypothesis that allocating higher precision to deeper layers preserves semantic fidelity
and enhances learning stability.</p>
        <p>Overall, the results demonstrate that although quantization techniques such as ascending, descending,
and fixed-bit quantization may slightly reduce model accuracy, they provide significant advantages in
energy eficiency, reduced communication overhead, and faster convergence. These benefits make them
especially useful for neural network applications, where computational and resource eficiency are
essential. These findings show the trade-of between accuracy and environmental impact, illustrating
quantized training as a more energy-eficient alternative to deep learning, compared to full-precision
training approaches. Although they are less accurate, these methods are far more energy eficient, and
thus reasonable techniques to train resource-eficient models with low performance degradation.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Limitation</title>
      <p>Apart from the benefits that layer-wise quantization ofers, one of its major drawbacks is the loss
of numeric accuracy, which can cause errors, especially with very low bit widths [18]. The errors
introduced in each layer accumulate, and by the end of the process, the overall performance of the
model is degraded [27]. This performance is significant in the case of deep networks. Moreover, due
to the complexity of the network structure used and the presence of diferent types of data patterns,
layer-wise quantization techniques may not work efectively in such scenarios, which in turn results in
non-optimal quantization settings [11]. The optimal bit-width configuration for each layer in a model
is often dificult and time consuming, especially for large models. A brute-force search is frequently
impractical as the number of combinations is large [28]. Moreover, PTQ is typically faster than QAT,
although some variations at the layer-wise level can be costly to compute. For instance, it is infeasible to
search for the best bit-width of each layer using brute force search in large networks with larger number
of layers [11]. Another problem in layer-wise quantization is the over-fitting of the training dataset
by the model, which might hinder the ability of the trained model to perform well on new, unseen
data. Furthermore, when considering diferent bit-widths per layer (mixed precision), the optimization
problem is further exacerbated by an even larger search space [29]. Even though layer-wise quantization
is a helpful technique for model compression, its possible drawbacks should also be considered, and a
balance should be maintained among loss of accuracy, error propagation, and optimization level.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>In layer-wise quantization of ResNet18, diferent bit-widths are applied to diferent layers of the model
to reduce communication overhead or memory usage, especially in on-device inference. When focusing
on 3 specific layers, the choice of bit-width per layer can significantly impact model accuracy due to
quantization error, communication cost, computational eficiency, and energy consumption. In our
work, layer1, layer2, and layer3, each corresponding to groups of residual blocks and progressively
deeper levels of feature extraction. Normally, the efect of equal quantization of 2, 2, and 2 bits in
3 diferent layers is uniform compression across all the layers. An extremely low bit width (2 bits)
leads to high quantization error, especially in deeper or more critical layers that typically require more
representational power. This may result in significant accuracy degradation, especially if the model is
sensitive to weight precision. However, the advantage is the minimal communication, storage cost, and
simplicity in implementation.</p>
      <p>According to Table 4, the features of layer 1 are edges and textures. Hence, layer 1 is less sensitive
to quantization, especially at low bit-widths, and has a low risk of significant accuracy degradation
[30]. Layer 2 gives combinations of low-level features. Hence, moderate sensitivity to quantization and
important in preserving spatial coherence and abstraction [31]. Layer 3 features are abstract object
parts that are critical for classification. These features are highly sensitive to quantization errors, and a
low bit-width here can severely hurt model performance. In equal quantization, layer 3 sufers the most
with reduced capacity to capture high-level semantics [32]. Therefore, equal quantization gives the
best compression, minimal bandwidth/energy usage. Expected accuracy becomes significantly lower.
However, equal quantization can be used in ultra-low-resource systems where accuracy is not the
primary concern. In ascending quantization, lower bits for layer 1 are less critical, and higher bits for
layer 3 are most critical. It balances compression and accuracy. In comparison to equal quantization,
it slightly increased communication and computation cost. Its expected accuracy becomes closest to
the full-precision baseline among the three. Also, it is more suitable for FL, where both eficiency and
accuracy matter. Descending quantization prioritizes precision in layer 1 over layer 3. It preserves the
ifdelity of low-level features. However, the layer 3 quantization error leads to a loss in the quality of the
ifnal decision. Its accuracy is better than equal, worse than ascending. It is suitable for legacy systems
using partial inference ofloading, but not ideal in most cases.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Future Work</title>
      <p>In this study, we analyzed the impact of layer-wise quantization on the performance of ResNet18
by targeting three core layers, each representing progressively deeper levels of feature abstraction.
Through the application of equal, ascending, and descending quantization strategies with bit-widths of
2, 3, and 4, we demonstrated that the sensitivity to quantization is not uniform across layers. Specifically,
lower layers (e.g., layer 1), which extract low-level features such as edges and textures, exhibit higher
redundancy and tolerate aggressive quantization (e.g., 2-bit) with minimal accuracy degradation. In
contrast, deeper layers like layer 3, responsible for high-level semantic representations, are significantly
more sensitive to quantization errors. Our findings afirm that ascending quantization (e.g., 2, 3, 4
bits from shallow to deep layers) strikes the best trade-of between compression eficiency and model
performance. The insights derived from this work highlight the importance of precision allocation in
layer-wise quantization. By aligning quantization precision with the semantic importance of each layer,
one can optimize both resource utilization and inference accuracy, which is especially beneficial in
resource-constrained environments such as edge devices and federated learning setups.</p>
      <p>Future work will extend this study by exploring dynamic quantization schemes, where bit-widths
are adjusted in real time based on layer-wise gradient statistics or activation distributions.
Taskaware quantization, tailoring bit-widths according to the sensitivity of layers for specific downstream
tasks (e.g., detection vs. classification). Energy-aware quantization frameworks that integrate energy
profiling into the quantization decision process to further enhance the eficiency of on-device AI.
Crosslayer optimization techniques, such as joint pruning and quantization, are used to holistically reduce
computational overhead while maintaining performance. Adaptive training strategies that co-train
bit-widths and weights to minimize quantization-induced degradation during training. By addressing
these future directions, the community can move closer to achieving scalable, high-performance neural
networks suitable for deployment in low-power, latency-sensitive applications.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <sec id="sec-8-1">
        <title>The authors have not employed any Generative AI tools.</title>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>This work contributes to the basic research activities of the PNRR project FAIR - Future AI Research
(PE00000013), Spoke 9 - Green-aware AI, under the NRRP MUR program funded by the
NextGenerationEU.
[11] Y. Arai, Y. Ichikawa, Quantization error propagation: Revisiting layer-wise post-training
quantization, arXiv preprint arXiv:2504.09629 (2025).
[12] R.-G. Dumitru, V. Yadav, R. Maheshwary, P. I. Clotan, S. T. Madhusudhan, M. Surdeanu,
Layerwise quantization: A pragmatic and efective method for quantizing LLMs, 2024. URL: https:
//openreview.net/forum?id=eJVrwDE086.
[13] S. Gluska, M. Grobman, Exploring neural networks quantization via layer-wise quantization
analysis, arXiv preprint arXiv:2012.08420 (2020).
[14] H. Huang, H. Yu, Ltnn: A layerwise tensorized compression of multilayer neural network, IEEE
transactions on neural networks and learning systems 30 (2018) 1497–1511.
[15] A. D. Nguyen, I. Markov, A. Ramezani-Kebrya, K. Antonakopoulos, D. Alistarh, V. Cevher,
Layerwise quantization for distributed variational inequalities, in: Workshop on Machine Learning and
Compression, NeurIPS 2024, 2024.
[16] T. Shinde, Adaptive quantization and pruning of deep neural networks via layer importance
estimation, in: Workshop on Machine Learning and Compression, NeurIPS 2024, 2024. URL:
https://openreview.net/forum?id=kf6x9RCvHf.
[17] T. Shinde, R. Jain, A. K. Sharma, Lightweight neural networks for speech emotion recognition
using layer-wise adaptive quantization, J. Name (2025).
[18] F. Zhang, Y. Liu, W. Li, J. Lv, X. Wang, Q. Bai, Towards superior quantization accuracy: A
layer-sensitive approach, arXiv preprint arXiv:2503.06518 (2025).
[19] X. Zhao, R. Xu, Y. Gao, V. Verma, M. R. Stan, X. Guo, Edge-mpq: Layer-wise mixed-precision
quantization with tightly integrated versatile inference units for edge computing, IEEE Transactions
on Computers (2024).
[20] X. Zhu, W. Zhou, H. Li, Adaptive layerwise quantization for deep neural network compression, in:
2018 IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2018, pp. 1–6.
[21] R.-G. Dumitru, V. Yadav, R. Maheshwary, P.-I. Clotan, S. T. Madhusudhan, M. Surdeanu, Layer-wise
quantization: A pragmatic and efective method for quantizing llms beyond integer bit-levels, 2024.
arXiv:2406.17415.
[22] M. Nagel, M. Fournarakis, R. A. Amjad, Y. Bondarenko, M. van Baalen, T. Blankevoort, A white
paper on neural network quantization, CoRR abs/2106.08295 (2021). arXiv:2106.08295.
[23] C. Yu, S. Yang, F. Zhang, H. Ma, A. Wang, E.-P. Li, Improving quantization-aware training of
lowprecision network via block replacement on full-precision counterpart, 2024. arXiv:2412.15846.
[24] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of
the IEEE conference on computer vision and pattern recognition (CVPR), 2016, pp. 770–778.
[25] A. Krizhevsky, Learning Multiple Layers of Features from Tiny Images, Technical Report, University
of Toronto, 2009.
[26] B. Courty, V. Schmidt, S. Luccioni, Goyal-Kamal, MarionCoutarel, B. Feld, J. Lecourt, LiamConnell,</p>
      <p>A. Saboni, Inimaz, et al., mlco2/codecarbon: v2.4.1, 2024.
[27] Z. Xu, S. Sharify, W. Yazar, T. Webb, X. Wang, Understanding the dificulty of low-precision
post-training quantization of large language models, arXiv e-prints (2024) arXiv–2410.
[28] D. Bablani, J. L. Mckinstry, S. K. Esser, R. Appuswamy, D. S. Modha, Eficient and efective methods
for mixed precision neural network quantization for faster, energy-eficient inference, arXiv
preprint arXiv:2301.13330 (2023).
[29] L. Wei, Z. Ma, C. Yang, Q. Yao, Advances in the neural network quantization: A comprehensive
review, Applied Sciences 14 (2024) 7445.
[30] A. Zhou, A. Yao, Y. Guo, L. Xu, Y. Chen, Incremental network quantization: Towards lossless CNNs
with low-precision weights, in: International Conference on Learning Representations, 2017.
[31] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, D. S. Modha, Learned step size quantization,
in: International Conference on Learning Representations, 2020.
[32] R. Banner, Y. Nahshan, D. Soudry, Post training 4-bit quantization of convolutional networks for
rapid-deployment, Curran Associates Inc., Red Hook, NY, USA, 2019.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          , in: F. Pereira,
          <string-name>
            <given-names>C.</given-names>
            <surname>Burges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          Weinberger (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>25</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          , CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ). arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bojarski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Testa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dworakowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Firner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Flepp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Jackel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Monfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Muller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zieba</surname>
          </string-name>
          ,
          <article-title>End to end learning for self-driving cars</article-title>
          ,
          <source>CoRR abs/1604</source>
          .07316 (
          <year>2016</year>
          ). arXiv:
          <volume>1604</volume>
          .
          <fpage>07316</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.-J. Yang</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          <string-name>
            <surname>Emer</surname>
          </string-name>
          ,
          <article-title>Eficient processing of deep neural networks: A tutorial and survey</article-title>
          ,
          <source>Proceedings of the IEEE</source>
          <volume>105</volume>
          (
          <year>2017</year>
          )
          <fpage>2295</fpage>
          -
          <lpage>2329</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>Towards eficient edge ai: A review on dnn quantization</article-title>
          ,
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          <volume>32</volume>
          (
          <year>2021</year>
          )
          <fpage>4106</fpage>
          -
          <lpage>4120</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jobin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ienca</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Vayena,</surname>
          </string-name>
          <article-title>The global landscape of ai ethics guidelines</article-title>
          ,
          <source>Nature Machine Intelligence</source>
          <volume>1</volume>
          (
          <year>2019</year>
          )
          <fpage>389</fpage>
          -
          <lpage>399</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Vinuesa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Azizpour</surname>
          </string-name>
          , I. Leite,
          <string-name>
            <given-names>M.</given-names>
            <surname>Balaam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dignum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Domisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Felländer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Langhans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tegmark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fuso</surname>
          </string-name>
          <string-name>
            <surname>Nerini</surname>
          </string-name>
          ,
          <article-title>The role of artificial intelligence in achieving the sustainable development goals</article-title>
          ,
          <source>Nature Communications</source>
          <volume>11</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>Hubara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Courbariaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Soudry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>El-Yaniv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Quantized neural networks: Training neural networks with low precision weights and activations</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>18</volume>
          (
          <year>2017</year>
          )
          <fpage>6869</fpage>
          -
          <lpage>6898</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <article-title>Mixed precision quantization based on information entropy</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>15</volume>
          (
          <year>2025</year>
          )
          <fpage>12974</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning-based layer-wise quantization for lightweight deep neural networks</article-title>
          ,
          <source>in: 2020 IEEE International Conference on Image Processing (ICIP)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>3070</fpage>
          -
          <lpage>3074</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>