1. Introduction

Layer-wise Quantization in Green-Aware AI⋆

Farwa Ikram

Dipanwita Thakur

Antonella Guzzo

Giancarlo Fortino

0 0 The University of Calabria , Arcavacata di Rende , Italy

2026

Deep neural networks (DNN) have become integral to many artificial intelligence (AI) applications due to their superior performance across a wide range of tasks. However, their deployment in edge devices is limited by high computational and energy demands, which also increase carbon emissions. Quantization, particularly layer-wise quantization, has emerged as a promising technique to mitigate these limitations by reducing numerical precision and computational overhead. In this paper, we present a comprehensive study on the efects of some layer-wise quantization strategies on the energy eficiency and the environmental impact of DNNs, with a specific focus on green-aware AI. We examine uniform, ascending, and descending quantization schemes to understand how per-layer bit-width assignments can be optimized for minimal energy usage while preserving model accuracy. Our analysis includes a quantitative assessment of energy consumption and associated 2 emissions. Experiments conducted on standard benchmarks, ResNet18 trained on CIFAR-10 and CIFAR-100, demonstrate that our method achieves up to a 45% reduction in energy consumption and memory usage with competitive degradation in accuracy, ofering a practical solution for sustainable AI deployment.

eol>Green AI Layer-wise quantization Energy Eficiency Carbon Emission

1. Introduction

Deep neural networks (DNNs) have revolutionized the field of machine learning, enabling breakthroughs in computer vision [ 1 ], natural language processing [ 2 ], and autonomous systems [ 3 ]. Despite their success, the computational intensity and memory footprint of DNNs have raised significant concerns, especially for deployment on resource-constrained edge devices such as mobile phones, embedded systems, and IoT devices [ 4, 5 ]. Moreover, the environmental cost of training and deploying large-scale DNNS, including energy consumption and 2 emissions has drawn increasing scrutiny for both academia and industry due to sustainable development goals (SDGs) defined by the United Nations [ 6, 7 ].

Quantization is a well established technique for reducing the complexity of DNNs by lowering the precision of weights and activations [ 8, 2 ]. While traditional quantization approaches typically apply a uniform bit-width across all layers [ 9 ], recent research suggests that a more nuanced, layer-wise strategy may better balance accuracy and eficiency. This paper explores the design and impact of some layer-wise quantization schemes, with a particular emphasis on green-aware computing, i.e., reducing the environmental impact of AI systems without sacrificing performance.

We investigate three layer-wise quantization strategies: uniform quantization, ascending quantization (lower to higher bit-width from shallow to deeper layers), and descending quantization (higher to lower bit-width). Through empirical analysis and benchmarking, we evaluate their efects on model performance, memory usage, energy consumption, and 2 emissions. Our work provides new insights into how intelligent bit-width allocation can yield more sustainable AI solutions.

1.1. Motivation

The motivation of this work stems from the growing need to reconcile the performance of deep learning models with environmental and deployment constraints. Many real-world applications require the deployment of DNNs on edge devices, which are limited in terms of computational power, memory, and battery life. Reducing the energy and memory demands of models is essential for feasible and responsive edge deployment. Training and deploying DNNs have a non-trivial environmental footprint. Studies show that model inference at scale contributes significantly to carbon emissions. Developing quantization strategies that minimize energy use can directly contribute to reducing this impact. While uniform quantization ofers simplicity, it may not be optimal. Diferent layers in a network have varying sensitivity to quantization, suggesting that intelligent, layer-specific strategies could yield better trade-ofs between eficiency and accuracy [ 9 ]. Previous work has not suficiently explored the impact of diferent layer-wise quantization strategies on green metrics such as energy consumption and 2 emissions. This study fills that gap by providing a systematic comparison and practical insights.

1.2. Contribution

This paper makes the following key contributions: • We systematically evaluate three distinct layer-wise quantization approaches, such as uniform, ascending, and descending, analyzing their trade-ofs in terms of energy eficiency, memory usage, and classification accuracy. We also analyze the energy and environmental impact of layer-wise quantization strategies in DNNs, focusing on reducing 2 emissions while maintaining model performance. • We present a quantitative assessment of energy savings and corresponding reductions in 2 emissions resulting from each quantization strategy, enabling a clearer understanding of their environmental benefits. • Extensive experiments on two diferent flavors of ResNet18 using CIFAR-10 and CIFAR-100 demonstrate that our proposed strategies can reduce energy consumption and memory footprint by up to 45%, with competitive degradation in accuracy.

2. Related Work

To achieve the objectives of Green-Aware AI, extensive work has been conducted on layer-wise quantization, which can lead to reduced communication and computing overhead. This method allows for control precision by layer, which allows for fine-tuning trade-ofs between memory access, size, and inference costs, resulting in a significant reduction of the environmental impact of deep learning models with little or no decline in performance. In [21], the authors proposed a layer-wise quantization method that assigns diferent bit-widths to layers based on their importance, using two scoring metrics, one measuring changes in input-output representations and another based on weight distribution. Applied with techniques like GPTQ and Quanto, this approach preserved over 90% accuracy even when compressing models like LLama-13B to an average of 3.25 bits, outperforming uniform quantization and moderate pruning in both classification and reasoning tasks. In [ 17], a lightweight multilayer perceptron (MLP) model was introduced for speech emotion recognition (SER) with a layer-wise adaptive quantization (LAQ) scheme to compress the size of the model while maintaining accuracy. Their approach applies diferent bit-widths to layers with scores based on parameter proportion, entropy, and weight variance. In another work [13], Gluska and Grobman proposed a method to identify which layers of a neural network contribute the most to quantization errors by quantizing one layer at a time. Their analysis showed that performance loss is often caused by a small number of layers. Using this insight, they applied targeted fixes such as clipping only the most problematic layer, which improved the accuracy of ResNet26 from 74.28% to 75.91%, outperforming global quantization techniques.

Model

Method

Bit Alloca- layer-wise Quantition Strategy zation Basis Descending Parameter count per

layer VGG-16, ResNet18 LLama2 LLama-13B ResNeXt26

RL-based MixedPrecision Quantization Error Propagation (QEP) Layer-wise Quantization via Importance Scores Layer-wise Er- Fixed ror Analysis

Mixed ImageNet LTNN (Tensorized Decomposition) WGANs Adaptive

Layer-wise

Quantization VGG19, Hybrid QuanResNet18/34 tization and

Pruning TESS, Layer-wise EMODB, Adaptive SAVEE Quantization

for SER LLaMA mod- SensiBoost, els KurtBoost ResNet18 ImageNet (AlexNet)

Edge-MPQ with RISC-V Entropybased layerwise Quantization

Descending Mixed precision per layer Adaptive Descending Descending Descending Mixed precision Descending

Error sensitivity across layers Layer importance via input-output changes and weight distribution Empirical error sensitivity (quantize one layer at a time) Tensor decomposition structure Gradient noise model and convergence behavior Statistical metrics (entropy, variance, sparsity) Parameter proportion, entropy, weight variance Activation sensitivity and kurtosis

Results 11× compression, 93.5% accuracy vs 93.6% Accuracy improved from 67.9% to 70.1% with 2-bit GPTQ Over 90% accuracy at 3.25 avg bits Accuracy improved from 74.28% to 75.91% Up to 64A compression, 13.95% top-5 error at 35.84A Up to 47% faster training as compared to Q-GenX 91% accuracy, avg. 1.08 bit-width, up to 81% sparsity Up to 99.29% accuracy Up to 9% lower perplexity with only 2% more memory 47.67A speedup, 6.7% higher PTQ accuracy Hardware constraints and training sensitivity Weight and activa- 90.28% accuracy, tion entropy 45.64% top-1 (ImageNet)

Similarly, in [18], a layer-sensitive quantization strategy for large language models that selectively assigns higher precision to layers identified as more sensitive to quantization errors. They developed SensiBoost and KurtBoost to guide memory allocation at the layer level. These methods improved quantization accuracy while staying within a modest memory budget. For instance, SensiBoost reduced perplexity by up to 9% on LLaMA models with only a 2% increase in memory. In [19], a novel hardware, layer-wise mixed-precision quantization (MPQ) framework tailored for edge computing, combining both eficient inference hardware and novel quantization search algorithms. Their design integrates versatile inference units supporting INT2 to INT16 directly into a RISC-V processor pipeline, achieving a speed up of up to 47.67× and energy eficiency of 20.51 TOPS/W over the baseline RV64IMA core. The authors in [20] presented an adaptive layer-wise quantization method that assigns bit-widths to each layer based on its importance, measured using the entropy of weights and activations. To stabilize activation distribution, they applied L2 regularization during training, enabling more efective compression. Their approach outperformed fixed-bit quantization methods in both accuracy and compression eficiency. On CIFAR-10, they achieved 90.28% accuracy with 27.5× compression for weights, and 86.14% accuracy with joint weight-activation quantization. In [ 10 ], the authors proposed a reinforcement learning-based framework to perform mixed-precision quantization across deep neural network layers. Instead of applying a fixed bit-width uniformly across all layers, their method leverages the parameter count of each layer to guide the bit allocation process. By using Deep Q-Learning, the system learns to assign diferent bit-widths to each layer in a way that balances accuracy with model size reduction. For example, VGG-16 on CIFAR-10 was compressed from 59.91MB to 5.57MB with only a 0. 1% drop in precision (93. 6% to 93. 5%). A novel approach called quantized optimistic dual averaging (QODA) was proposed in [15] for distributed variational inequality problems. Their approach provides theoretical guarantees on both quantization error and communication cost, and generalizes previous global quantization results. When applied in training WGAN on CIFAR-10 and CIFAR-100 among 4 GPUs, their method not only provided better accuracy than Q-GenX baseline but also accelerated the training process by up to 47%, which shows the theoretical and empirical advantages in distributed optimization. A hybrid approach of quantization and pruning via adaptation based upon the statistical significance of the neural network layer was proposed in [16]. The method adaptively optimizes bit width and pruning thresholds on a per-layer basis using measures such as entropy, variance, and sparsity to minimize model size without sacrificing accuracy. For VGG19, ResNet18, and ResNet34 on CIFAR-10, the method maintains accuracy over 91%, and averages bit-width as small as 1.08, and can induce parameter sparsity up to 81%. In [14], a compression technique was developed known as LTNN, which reshapes neural network weights into high-dimensional tensors and applies low-rank tensor train decomposition to employ in a layer-wise implementation for training to save model size and, at the same time, preserve accuracy. Additionally, they suggested quantizing tensor cores for more compression. On MNIST and CIFAR-10 benchmarks, LTNN can outperform state-of-the-art methods with compression and 21.57 x 2.2 loss compression. On ImageNet, they reported 8.8× compression with 13.15% top-5 error, which increased to 13.95% error at 35.84× compression with quantization. We summarize the proposed layer-wise quantization methods in Table 4.

3. Methodology

Quantization plays a crucial role in the eficient deployment of machine learning (ML) models and has consequently become a prominent focus in current research. One widely recognized technique for model compression that efectively reduces computational complexity and storage requirements is model quantization [22]. This approach involves transforming high-precision floating-point numbers, such as 32-bit representations, into lower-precision integers, including formats like 8-bit or 4-bit, and even smaller sizes. While quantization provides significant benefits in terms of computational eficiency and reduced storage demand, it also presents challenges related to precision loss [23]. The sensitivity of various layers within a model to quantization can difer significantly, and conventional fixed-precision quantization methods may adversely afect the performance of critical layers, thereby impacting the model’s overall inference capabilities. To mitigate these issues, researchers have been developing mixed-precision quantization techniques. These methods strategically allocate diferent bit lengths to individual layers, striking a balance between resource utilization and model performance, ultimately enhancing the efectiveness of the quantization process. We analyze the impact of fixed bit layer-wise quantization, as well as the diferent levels of precision to individual layers of a neural network, allowing for a more fine-grained trade-of between model accuracy and computational eficiency. The graphical representation of explained methodology is shown in Figure 1 .

3.1. Layer-wise Quantization

Layer-wise quantization is a model compression approach that quantizes diferent layers of a neural network with diferent numerical precisions. Rather than utilizing an equal bit-width policy across all layers, our solution stands for selective quantization, in which diferent layers are quantized in a ifne-grained manner that corresponds to the importance of the layer’s contribution to the overall model’s performance. Less important or more robust layers can usually be quantized to a smaller number of bits (2-bit or 3-bit), while sensitive layers are kept at higher precision to avoid loss in accuracy. This technique

Algorithm 1 layer-wise Quantization Strategy

optimizes the trade-of of computational cost and the model quality by focusing its quantization on the most beneficial places. For example, in convolutional networks, the first or middle layers can tolerate a higher level of aggressive quantization as they’re naturally redundant in the extraction of the features, while the later layers are often placed near the decision boundary, which requires higher precision for classification accuracy. The quantization of each layer is conducted layer-wise. A fixed-bit quantization is that all of the selected layers use the same bit-width. Existing adaptive quantization techniques employ ascending-trend, where the quantization level rises as training progresses and descending-trend quantization, where the quantization level decreases as training progresses, in addition to fixed-bit quantization. The algorithm 1 shows the steps of the layer-wise quantization method in which a neural network consisting of layers, arranged in topological order as 1, 2, . . . , . Each layer typically performs a standard operation such as convolution, fully connected computation, or pooling. In this method, three specific layers, denoted as 1 , 2 , and 3 , are selected for quantization, while the remaining layers are retained in full precision. Lines 5-9 explore three configurations:(1) ascending quantization, where layers are quantized using increasing precision levels (2) descending quantization, where decreasing precision is applied across layers, and (3) fixed-bit quantization , in which all selected layers are quantized using the same bit-width. Lines 13-14 show application of activation quantization to its weights and its input/output for each quantized layer. Algorithm 1 continues to run until either it converges or satisfies a predefined condition, such as achieving 95% accuracy.

3.2. Customized ResNet18 for CIFAR-10

ResNet18 [24] is one of the popular DNN-based model for image classification. The architecture of ResNet18 used for image classification is represented in Table 2, known as ResNet-CIFAR. The ResNetCIFAR architecture begins with an initial convolutional layer (Conv1) that applies a 33 convolution to the input RGB image, increasing the channel depth from 3 to 16. This is followed by batch normalization and a ReLU activation function, preserving the spatial resolution of 3232. The first major stage, Layer1, consists of multiple residual blocks (as specified by layers[0]) and operates at a constant resolution of 3232, maintaining 16 output channels. Layer2 introduces spatial downsampling via a stride of 2 in the ifrst residual block, reducing the feature map size to 16 ×16 and increasing the number of channels to 32; it contains layers[ 1 ] residual blocks. Similarly, Layer3 performs another downsampling step, reducing the resolution to 88 while increasing the channel count to 64, with layers[ 2 ] residual blocks. Following these stages, a global average pooling layer aggregates each 8×8 feature map into a single scalar value, resulting in a 64-dimensional feature vector. Finally, a fully connected (FC) layer maps this vector to the desired number of output classes, producing the final classification logits. Notably, each residual layer group can employ a diferent quantization bit-width, allowing for layer-wise precision control that balances accuracy and eficiency. 3 layers consist of several residual blocks are quantized using equal, ascending, and descending bit precision.

4. Experimental Setup

We conducted layer-wise quantization on the CIFAR-10 and CIFAR-100 [25] datasets, involving 400 epochs with a learning rate of 0.1. CIFAR-10 and CIFAR-100 each consist of 60,000 RGB images of size 32×32 pixels. However, CIFAR-10 is categorized into 10 classes, while CIFAR-100 includes a more ifne-grained classification across 100 distinct classes. This configuration was aimed at evaluating how well a ResNet18 model performed while employing the layer-wise quantization technique, evaluating model eficiency, time, loss, and accuracy. The network was optimized using the SGD optimizer with a momentum of 0.9. We also conducted experiments using the full-precision model to analyze its energy consumption, enabling a comparative assessment of the eficiency gains achieved through (a) Train Accuracy (b) Train Loss (c) Test Accuracy (d) Test Loss (a) Train Accuracy (b) Train Loss (c) Test Accuracy (d) Test Loss quantization. This quantization phase proved to be a key factor in decreasing the energy cost of the model’s execution. We used the CodeCarbon software package [26] to accurately track the energy use and corresponding carbon emissions associated with the quantization technique. The experiments were performed using an Intel(R) Xeon(R) W-2265 CPU 3.50GHz, which has a Linux operating system of 76 cores and supports Metal 3, the latest graphics application programming interface (API) designed to boost graphics rendering on the platform.

4.1. Experimental Results

Table 3 presents a comparative analysis of diferent quantization strategies on CIFAR-10 and CIFAR-100 using ResNet18. Full precision achieves the highest accuracy but incurs the highest energy consumption and carbon emissions. Among quantized approaches, the ascending bit-width strategy consistently outperforms descending and fixed-bit configurations in both accuracy and loss, indicating that allocating higher precision to deeper layers improves performance. To ensure robustness, each experiment was conducted for 400 rounds. Performance is reported including both mean and standard deviation, as calculated using the statistical functions. These values reflect the variability observed within a single training run, based on the measurements collected throughout the training process. In particular, quantized methods significantly reduce energy and emissions, highlighting their potential for sustainable deployment, while maintaining competitive accuracy, especially on the less complex CIFAR-10 dataset. Figures 2 and 3 illustrate the training and testing accuracy and loss trends over communication rounds for CIFAR-10 and CIFAR-100, respectively. In CIFAR-10 (Figure 2), all quantization schemes converge stably, with full precision performing best, and ascending quantization closely following. Fixed-bit and descending strategies show slightly degraded accuracy and slower convergence. In CIFAR-100 (Figure 3), the performance gap widens, indicating greater sensitivity to quantization due to increased task complexity. Notably, ascending quantization again outperforms fixed and descending bit schemes, supporting the hypothesis that allocating higher precision to deeper layers preserves semantic fidelity and enhances learning stability.

Overall, the results demonstrate that although quantization techniques such as ascending, descending, and fixed-bit quantization may slightly reduce model accuracy, they provide significant advantages in energy eficiency, reduced communication overhead, and faster convergence. These benefits make them especially useful for neural network applications, where computational and resource eficiency are essential. These findings show the trade-of between accuracy and environmental impact, illustrating quantized training as a more energy-eficient alternative to deep learning, compared to full-precision training approaches. Although they are less accurate, these methods are far more energy eficient, and thus reasonable techniques to train resource-eficient models with low performance degradation.

5. Limitation

Apart from the benefits that layer-wise quantization ofers, one of its major drawbacks is the loss of numeric accuracy, which can cause errors, especially with very low bit widths [18]. The errors introduced in each layer accumulate, and by the end of the process, the overall performance of the model is degraded [27]. This performance is significant in the case of deep networks. Moreover, due to the complexity of the network structure used and the presence of diferent types of data patterns, layer-wise quantization techniques may not work efectively in such scenarios, which in turn results in non-optimal quantization settings [11]. The optimal bit-width configuration for each layer in a model is often dificult and time consuming, especially for large models. A brute-force search is frequently impractical as the number of combinations is large [28]. Moreover, PTQ is typically faster than QAT, although some variations at the layer-wise level can be costly to compute. For instance, it is infeasible to search for the best bit-width of each layer using brute force search in large networks with larger number of layers [11]. Another problem in layer-wise quantization is the over-fitting of the training dataset by the model, which might hinder the ability of the trained model to perform well on new, unseen data. Furthermore, when considering diferent bit-widths per layer (mixed precision), the optimization problem is further exacerbated by an even larger search space [29]. Even though layer-wise quantization is a helpful technique for model compression, its possible drawbacks should also be considered, and a balance should be maintained among loss of accuracy, error propagation, and optimization level.

6. Discussion

In layer-wise quantization of ResNet18, diferent bit-widths are applied to diferent layers of the model to reduce communication overhead or memory usage, especially in on-device inference. When focusing on 3 specific layers, the choice of bit-width per layer can significantly impact model accuracy due to quantization error, communication cost, computational eficiency, and energy consumption. In our work, layer1, layer2, and layer3, each corresponding to groups of residual blocks and progressively deeper levels of feature extraction. Normally, the efect of equal quantization of 2, 2, and 2 bits in 3 diferent layers is uniform compression across all the layers. An extremely low bit width (2 bits) leads to high quantization error, especially in deeper or more critical layers that typically require more representational power. This may result in significant accuracy degradation, especially if the model is sensitive to weight precision. However, the advantage is the minimal communication, storage cost, and simplicity in implementation.

According to Table 4, the features of layer 1 are edges and textures. Hence, layer 1 is less sensitive to quantization, especially at low bit-widths, and has a low risk of significant accuracy degradation [30]. Layer 2 gives combinations of low-level features. Hence, moderate sensitivity to quantization and important in preserving spatial coherence and abstraction [31]. Layer 3 features are abstract object parts that are critical for classification. These features are highly sensitive to quantization errors, and a low bit-width here can severely hurt model performance. In equal quantization, layer 3 sufers the most with reduced capacity to capture high-level semantics [32]. Therefore, equal quantization gives the best compression, minimal bandwidth/energy usage. Expected accuracy becomes significantly lower. However, equal quantization can be used in ultra-low-resource systems where accuracy is not the primary concern. In ascending quantization, lower bits for layer 1 are less critical, and higher bits for layer 3 are most critical. It balances compression and accuracy. In comparison to equal quantization, it slightly increased communication and computation cost. Its expected accuracy becomes closest to the full-precision baseline among the three. Also, it is more suitable for FL, where both eficiency and accuracy matter. Descending quantization prioritizes precision in layer 1 over layer 3. It preserves the ifdelity of low-level features. However, the layer 3 quantization error leads to a loss in the quality of the ifnal decision. Its accuracy is better than equal, worse than ascending. It is suitable for legacy systems using partial inference ofloading, but not ideal in most cases.

7. Conclusion and Future Work

In this study, we analyzed the impact of layer-wise quantization on the performance of ResNet18 by targeting three core layers, each representing progressively deeper levels of feature abstraction. Through the application of equal, ascending, and descending quantization strategies with bit-widths of 2, 3, and 4, we demonstrated that the sensitivity to quantization is not uniform across layers. Specifically, lower layers (e.g., layer 1), which extract low-level features such as edges and textures, exhibit higher redundancy and tolerate aggressive quantization (e.g., 2-bit) with minimal accuracy degradation. In contrast, deeper layers like layer 3, responsible for high-level semantic representations, are significantly more sensitive to quantization errors. Our findings afirm that ascending quantization (e.g., 2, 3, 4 bits from shallow to deep layers) strikes the best trade-of between compression eficiency and model performance. The insights derived from this work highlight the importance of precision allocation in layer-wise quantization. By aligning quantization precision with the semantic importance of each layer, one can optimize both resource utilization and inference accuracy, which is especially beneficial in resource-constrained environments such as edge devices and federated learning setups.

Future work will extend this study by exploring dynamic quantization schemes, where bit-widths are adjusted in real time based on layer-wise gradient statistics or activation distributions. Taskaware quantization, tailoring bit-widths according to the sensitivity of layers for specific downstream tasks (e.g., detection vs. classification). Energy-aware quantization frameworks that integrate energy profiling into the quantization decision process to further enhance the eficiency of on-device AI. Crosslayer optimization techniques, such as joint pruning and quantization, are used to holistically reduce computational overhead while maintaining performance. Adaptive training strategies that co-train bit-widths and weights to minimize quantization-induced degradation during training. By addressing these future directions, the community can move closer to achieving scalable, high-performance neural networks suitable for deployment in low-power, latency-sensitive applications.

Declaration on Generative AI The authors have not employed any Generative AI tools. Acknowledgments

This work contributes to the basic research activities of the PNRR project FAIR - Future AI Research (PE00000013), Spoke 9 - Green-aware AI, under the NRRP MUR program funded by the NextGenerationEU. [11] Y. Arai, Y. Ichikawa, Quantization error propagation: Revisiting layer-wise post-training quantization, arXiv preprint arXiv:2504.09629 (2025). [12] R.-G. Dumitru, V. Yadav, R. Maheshwary, P. I. Clotan, S. T. Madhusudhan, M. Surdeanu, Layerwise quantization: A pragmatic and efective method for quantizing LLMs, 2024. URL: https: //openreview.net/forum?id=eJVrwDE086. [13] S. Gluska, M. Grobman, Exploring neural networks quantization via layer-wise quantization analysis, arXiv preprint arXiv:2012.08420 (2020). [14] H. Huang, H. Yu, Ltnn: A layerwise tensorized compression of multilayer neural network, IEEE transactions on neural networks and learning systems 30 (2018) 1497–1511. [15] A. D. Nguyen, I. Markov, A. Ramezani-Kebrya, K. Antonakopoulos, D. Alistarh, V. Cevher, Layerwise quantization for distributed variational inequalities, in: Workshop on Machine Learning and Compression, NeurIPS 2024, 2024. [16] T. Shinde, Adaptive quantization and pruning of deep neural networks via layer importance estimation, in: Workshop on Machine Learning and Compression, NeurIPS 2024, 2024. URL: https://openreview.net/forum?id=kf6x9RCvHf. [17] T. Shinde, R. Jain, A. K. Sharma, Lightweight neural networks for speech emotion recognition using layer-wise adaptive quantization, J. Name (2025). [18] F. Zhang, Y. Liu, W. Li, J. Lv, X. Wang, Q. Bai, Towards superior quantization accuracy: A layer-sensitive approach, arXiv preprint arXiv:2503.06518 (2025). [19] X. Zhao, R. Xu, Y. Gao, V. Verma, M. R. Stan, X. Guo, Edge-mpq: Layer-wise mixed-precision quantization with tightly integrated versatile inference units for edge computing, IEEE Transactions on Computers (2024). [20] X. Zhu, W. Zhou, H. Li, Adaptive layerwise quantization for deep neural network compression, in: 2018 IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2018, pp. 1–6. [21] R.-G. Dumitru, V. Yadav, R. Maheshwary, P.-I. Clotan, S. T. Madhusudhan, M. Surdeanu, Layer-wise quantization: A pragmatic and efective method for quantizing llms beyond integer bit-levels, 2024. arXiv:2406.17415. [22] M. Nagel, M. Fournarakis, R. A. Amjad, Y. Bondarenko, M. van Baalen, T. Blankevoort, A white paper on neural network quantization, CoRR abs/2106.08295 (2021). arXiv:2106.08295. [23] C. Yu, S. Yang, F. Zhang, H. Ma, A. Wang, E.-P. Li, Improving quantization-aware training of lowprecision network via block replacement on full-precision counterpart, 2024. arXiv:2412.15846. [24] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2016, pp. 770–778. [25] A. Krizhevsky, Learning Multiple Layers of Features from Tiny Images, Technical Report, University of Toronto, 2009. [26] B. Courty, V. Schmidt, S. Luccioni, Goyal-Kamal, MarionCoutarel, B. Feld, J. Lecourt, LiamConnell,

A. Saboni, Inimaz, et al., mlco2/codecarbon: v2.4.1, 2024. [27] Z. Xu, S. Sharify, W. Yazar, T. Webb, X. Wang, Understanding the dificulty of low-precision post-training quantization of large language models, arXiv e-prints (2024) arXiv–2410. [28] D. Bablani, J. L. Mckinstry, S. K. Esser, R. Appuswamy, D. S. Modha, Eficient and efective methods for mixed precision neural network quantization for faster, energy-eficient inference, arXiv preprint arXiv:2301.13330 (2023). [29] L. Wei, Z. Ma, C. Yang, Q. Yao, Advances in the neural network quantization: A comprehensive review, Applied Sciences 14 (2024) 7445. [30] A. Zhou, A. Yao, Y. Guo, L. Xu, Y. Chen, Incremental network quantization: Towards lossless CNNs with low-precision weights, in: International Conference on Learning Representations, 2017. [31] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, D. S. Modha, Learned step size quantization, in: International Conference on Learning Representations, 2020. [32] R. Banner, Y. Nahshan, D. Soudry, Post training 4-bit quantization of convolutional networks for rapid-deployment, Curran Associates Inc., Red Hook, NY, USA, 2019.

[1]

Krizhevsky , I. Sutskever,

G. E.

Hinton , Imagenet classification with deep convolutional neural networks , in: F. Pereira,

Burges ,

Bottou , K. Weinberger (Eds.), Advances in Neural Information Processing Systems , volume 25 , Curran

Associates

, Inc., 2012 .

[2]

Devlin ,

Chang ,

Lee ,

Toutanova , BERT: pre-training of deep bidirectional transformers for language understanding , CoRR abs/ 1810 .04805 ( 2018 ). arXiv: 1810 .04805.

[3]

Bojarski ,

D. D.

Testa ,

Dworakowski ,

Firner ,

Flepp ,

Goyal ,

L. D.

Jackel ,

Monfort ,

Muller ,

Zhang ,

Zhao ,

Zieba , End to end learning for self-driving cars , CoRR abs/1604 .07316 ( 2016 ). arXiv: 1604 . 07316 .

[4]

Sze ,

Y.-H.

Chen , T.-J. Yang , J. S. Emer , Eficient processing of deep neural networks: A tutorial and survey , Proceedings of the IEEE 105 ( 2017 ) 2295 - 2329 .

[5]

Xu ,

Yu ,

Liang , Towards eficient edge ai: A review on dnn quantization , IEEE Transactions on Neural Networks and Learning Systems 32 ( 2021 ) 4106 - 4120 .

[6]

Jobin ,

Ienca , E. Vayena, The global landscape of ai ethics guidelines , Nature Machine Intelligence 1 ( 2019 ) 389 - 399 .

[7]

Vinuesa ,

Azizpour , I. Leite,

Balaam ,

Dignum ,

Domisch ,

Felländer ,

S. D.

Langhans ,

Tegmark ,

Fuso Nerini , The role of artificial intelligence in achieving the sustainable development goals , Nature Communications 11 ( 2020 ) 1 - 10 .

[8]

Hubara ,

Courbariaux ,

Soudry ,

El-Yaniv ,

Bengio , Quantized neural networks: Training neural networks with low precision weights and activations , Journal of Machine Learning Research 18 ( 2017 ) 6869 - 6898 .

[9]

Qin ,

Li ,

Zhao ,

Yan ,

Du , Mixed precision quantization based on information entropy , Scientific Reports 15 ( 2025 ) 12974 .

[10]

Jung ,

Kim ,

Kim , Reinforcement learning-based layer-wise quantization for lightweight deep neural networks , in: 2020 IEEE International Conference on Image Processing (ICIP) , IEEE, 2020 , pp. 3070 - 3074 .