<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Investigation of Machine Learning Techniques for Restoring the Quality of Compressed Audio</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Artem Panchenko</string-name>
          <email>artem.panchenko@karazin.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Denys Yarkin</string-name>
          <email>denys.yarkin@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vyacheslav Frolov</string-name>
          <email>v.v.frolov@karazin.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ievgen Meniailov</string-name>
          <email>ievgen.meniailov@karazin.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maryna Vladymyrova</string-name>
          <email>vladymyrova@karazin.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Victoriya Kuznietcova</string-name>
          <email>vkuznietcova@karazin.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>V.N. Karazin Kharkiv National University</institution>
          ,
          <addr-line>4 Svobody sq., Kharkiv, 61101</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>This paper investigates machine learning methods for restoring the quality of audio signals compressed with lossy algorithms. A dedicated data preparation methodology was developed, which involved converting WAV files into MP3 format (96 kbps) and subsequently decoding them back into WAV. Three neural network architectures were explored: a linear model, a convolutional neural network (CNN), and a U-Net-like model. The convolutional model demonstrated the best performance for single-sample restoration, but ofered the additional advantage of reconstructing entire audio segments rather than individual samples. The results confirm the potential of neural networks for enhancing the quality of compressed audio signals. Visual spectrogram analysis further revealed a noticeable reduction in compression-induced noise. The main limitation of the study was the available computational resources, which constrained the scale and complexity of experiments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;neural networks</kwd>
        <kwd>machine learning</kwd>
        <kwd>data samples restoring</kwd>
        <kwd>convolutional neural network</kwd>
        <kwd>UNet model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The restrictions introduced during the COVID-19 pandemic in 2020 forced businesses worldwide to
transition, wherever feasible, to remote work [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, even after the pandemic subsided, a significant
proportion of companies retained remote or hybrid work arrangements. In such settings,
communication between employees is predominantly mediated through audio calls or video conferencing.
Nevertheless, despite the current era of pervasive digitalization, even densely populated regions still
contain areas with unstable connectivity and limited internet bandwidth, which severely complicates
or in some cases renders impossible—efective remote work for residents of these regions. Ensuring
robust and high-quality communication under conditions of unreliable connectivity thus represents an
urgent challenge for the development of advanced communication tools. The present study addresses
this issue by proposing a solution aimed at improving the quality of audio transmission.
      </p>
      <p>
        A wide range of formats exists for the storage of audio information, which can be broadly divided
into two categories: lossless formats, which preserve the entirety of the original data, and lossy formats,
which achieve compression by discarding certain portions of the information [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. One of the most
well-known, widespread, and technically straightforward lossless formats is WAV (Waveform Audio
File) [3]. In this representation, the audio signal is stored as a sequence of discrete samples of the
amplitude of the sound wave. To produce such a representation, the audio signal is measured at regular
intervals in time, with each recorded value rounded to the nearest available level from a discrete set.
The number of available levels is determined by the bit depth (e.g., 16-bit), while the frequency of
these measurements is referred to as the sampling rate. This form of representation is often termed a
waveform.
      </p>
      <p>In contrast, the most widely used lossy format for audio storage is MP3 (formally, MPEG-1 Audio
Layer III or MPEG-2 Audio Layer III), which gained global popularity due to its ability to drastically
reduce file size by leveraging psychoacoustic models [ 4]. The underlying algorithm analyzes the audio
signal and selectively discards components that are imperceptible to the human auditory system, while
encoding the remaining content with relatively minor perceptual loss.</p>
      <p>This principle can be applied to optimize the transmission of audio between devices: compressing the
signal on one device, transmitting the reduced data, and subsequently reconstructing it on the receiving
device. It should be noted, however, that the problem of signal recovery in the presence of packet loss
during transmission remains a critical aspect of ensuring communication reliability.</p>
      <p>The aim of this study is to develop a neural network to validate the feasibility of employing this
approach for restoring the quality of audio after compression. Furthermore, the research seeks to
compare the performance of several neural network architectures and to identify the one most suitable
for addressing the problem of post-decompression audio quality restoration.</p>
      <p>This paper should be regarded as both an introduction to the research area and it is structured as
follows:
• Section 3 presents the dataset employed in the study, along with the details of its preprocessing.
• Section 4 describes the configuration and application of Linear Neural Networks for addressing
the defined tasks.
• Section 5 outlines the setup and utilization of Convolutional Neural Networks (CNNs) for solving
the proposed problems.
• Section 6 details the configuration and use of the UNet model in the context of the study objectives.
• Section 7 provides a comparative analysis of the obtained results.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Prior to the widespread adoption of neural networks for audio processing, a variety of methods and
conceptual approaches were proposed to address the problem of restoring degraded audio quality.</p>
      <p>In [5], the authors introduced the idea of estimating the noise spectrum during pauses in speech and
subsequently subtracting it from the active signal spectrum. This method demonstrated a reduction of
quantization noise and the characteristic “buzzing” artifacts in the reconstructed signal.</p>
      <p>In [6], a phase decoder was employed to mitigate phase distortions by reintroducing missing
harmonics and smoothing phase fluctuations.</p>
      <p>However, with the rapid development of machine learning, research eforts shifted towards neural
network–based methods.</p>
      <p>In [7], the authors presented a comprehensive review of the challenges and limitations of applying
machine learning (ML) to audio restoration. The study examined the role of dataset selection (paired
noisy/clean vs. unpaired real-world recordings), the issue of domain shift when synthesizing artifacts
(e.g., Gaussian and pink noise, codec distortions, clicks), and the comparative performance of key
neural architectures, including CNNs, RNNs, autoencoders/variational autoencoders (AE/VAE), GANs,
and transformers. Special attention was given to the choice of loss functions (MSE/MAE, spectral
convergence, log-STFT), as well as critical constraints such as overfitting, inference latency (&lt;10 ms for
real-time applications), and computational cost. The authors concluded that while ML-based approaches
exhibit substantial potential for audio restoration, their practical implementation is hindered by the
scarcity of representative real-world datasets and the mismatch between objective evaluation metrics
and subjective human perception of quality.</p>
      <p>In [8], a lightweight Wavenet-inspired architecture was introduced to enhance audio quality on
resource-constrained devices such as hearing aids, enabling real-time noise suppression under
challenging acoustic conditions.</p>
      <p>In [9], the signal was decomposed into harmonic components, and the resulting representation was
processed with architectures originally designed for image analysis, ofering an alternative paradigm
for audio enhancement.</p>
      <p>In [10], the authors proposed a hybrid architecture combining encoder–decoder structures with
recurrent LSTM layers. This model, comprising approximately 3.7 million parameters, achieved first
and second place in the “Interspeech 2020 Deep Noise Suppression (DNS) Challenge” in the categories
of real-time and ofline speech enhancement, respectively.</p>
      <p>Overall, the review of existing work demonstrates that, despite current challenges, neural networks
and related ML methods hold significant promise for audio restoration. To ensure robust model training,
however, careful dataset design and the selection of appropriate evaluation metrics are essential.</p>
      <p>A notable limitation of prior studies is their focus on restoring either entire audio segments or the
last sample within a sliding window. None of the works considered the possibility of reconstructing a
central sample within a window, which could leverage a richer contextual neighborhood and potentially
improve performance. Furthermore, existing studies generally do not compare multiple models and
approaches on a unified dataset, making it dificult to establish which architectures are more efective
under consistent experimental conditions. Similarly, little systematic attention has been devoted to
analyzing which specific neural network parameters most strongly influence performance.</p>
      <p>The present study seeks to address these gaps by systematically evaluating architectures, training
configurations, and contextual prediction strategies for audio restoration.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Data preparation</title>
      <p>For the construction of the training dataset, the LJ Speech Dataset [11] was selected, which comprises
13,100 short audio recordings of a single speaker reading passages from an English-language book,
provided in WAV format.</p>
      <p>There exist several approaches to representing audio signals. In this study, we adopt the waveform
representation, as it is the most practical for real-time audio restoration tasks. This choice eliminates
the need for prior signal transformations before applying the model, thereby ensuring that the
experimental setup closely resembles real-world conditions. Under this representation, the responsibility for
identifying dependencies and patterns within the data is entirely delegated to the model developed in
the course of this work.</p>
      <p>During dataset preparation, each original WAV audio track was converted into the MP3 format at a
bitrate of 96 kbps using the open-source encoder LAME (Lame Ain’t an MP3 Encoder). Subsequently,
the MP3 files were reconverted back into WAV format with identical parameters (16-bit, 22050 Hz). This
process produced paired audio samples: a “high-quality” version encoded at the original bitrate and a
“degraded” version with lower quality, both represented in waveform form.</p>
      <p>To construct the training dataset, the degraded signal was segmented into overlapping windows of
neighboring samples. In this work, each window consisted of 201 samples, corresponding to
approximately 9 milliseconds of audio. These windows were used to predict the original sample located at
the center of the window. This design choice was motivated by the intent to provide the model with a
broader temporal context for the restoration of the target sample. As a result, each training instance
consisted of 201 input values and a single target value.</p>
      <p>It should be emphasized that, due to computational limitations and the requirements of training
relatively large models for this task, a reduced dataset of 5 GB was utilized. While this volume is
insuficient for building a state-of-the-art model, it nonetheless enables the investigation of how
diferent model parameters afect restoration quality, the identification of optimal configurations, and
the formulation of directions for future development of the proposed architectures.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Linear model</title>
      <p>The linear model is defined as the simplest architecture among those tested, consisting of an input
layer followed by several dense layers, each accompanied by an activation function. Additionally, a
BatchNorm1d layer was inserted between the dense layers to ensure more stable training.</p>
      <p>To determine the most suitable activation function for the linear model—specifically, the one that
allows the model to achieve the lowest average loss over an equal number of epochs—a configuration
with three layers and approximately 100,000 parameters was selected. The best results were obtained
with the tanh activation function, yielding a loss value of 1.69e-4.</p>
      <p>However, it should be noted that despite the inclusion of BatchNorm1d, the use of tanh in deeper
models with a larger number of layers may still lead to issues such as the exploding gradient and
vanishing gradient problems. For this reason, it is reasonable to further evaluate the three activation
functions that demonstrated the most promising performance—Tanh, ReLU, and SiLU—on a diferent
architecture consisting of seven layers with the same parameter count (100,000). In addition to the
achieved loss, special attention is given to the training time required for the same, relatively small
number of epochs.</p>
      <p>It is important to note that, for the deeper model, a smaller learning rate was employed. This
adjustment was necessary because the mathematical function represented by such a model becomes
significantly more complex, and a larger learning rate may cause the optimization process to “overshoot”
local minima. Under these conditions, the Tanh activation function continued to yield superior results
compared to ReLU and SiLU, while maintaining approximately the same training time. Consequently,
Tanh was selected as the activation function for all subsequent experiments with the linear model.</p>
      <p>To determine the optimal number of layers, models consisting of 1, 3, 5, 7, 11, and 19 layers were
evaluated. Since diferent configurations require distinct hyperparameter settings—specifically with
respect to the scheduler and learning rate—multiple training runs were conducted for each version, and
the best-performing result was selected for comparison.</p>
      <p>The 3-layer architecture with 50,000 parameters achieved the lowest loss value. Consequently,
this configuration will be considered the “baseline,” and subsequent models with a similar number of
parameters will be evaluated relative to it.</p>
      <p>Additionally, several types of one-dimensional pooling layers were tested:
• MaxPool1d, which selects the maximum value from each one-dimensional window of the input
signal
• AvgPool1d, which computes the average value for each one-dimensional window of the input
signal
• LPPool1d, which, for each input window, computes  norm. The latter generalizes the notion of
distance or magnitude by raising the absolute values of the elements within the window to the
power of  summing them, and subsequently taking the -th root of this sum</p>
      <p>As can be observed, in this case the use of pooling layers does not improve the model’s accuracy. It is
also worth noting that training the model with LPPool1d requires at least 2.5 times more time compared
to other pooling functions, which have only a minor impact on the overall training time.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Convolutional Neural Network</title>
      <p>The convolutional model is structurally similar to the linear model, but with a crucial distinction: several
of the initial dense layers are replaced with convolutional (Conv1D) layers, which, in our case, are
applied to the processing of one-dimensional audio signals.</p>
      <p>Initially, we test diferent values of kernel size and dilation. At this stage, our objective is not to
identify the most “optimal” parameters for solving the task, but rather to determine which parameter
values are meaningful in our case. For this purpose, we construct a simple model consisting of an input
layer, a convolutional layer, a flatten layer, a single fully connected layer, and an output layer, with a
total parameter count of approximately 50,000. The default value of dilation is set to 1, meaning that
the kernel elements are positioned adjacent to one another.</p>
      <p>As can be observed, the lowest loss value was achieved with a convolutional layer configured with
dilation = 2 and kernel size = 5.</p>
      <p>Next, this layer will be duplicated while maintaining the total number of model parameters at
approximately 50,000. This approach enables the determination of the optimal number of convolutional
layers. The architecture with three convolutional layers demonstrated the highest eficiency; therefore,
it will be employed in subsequent experiments.</p>
      <p>A common approach in the design of convolutional neural networks is to combine layers with
diferent kernel sizes, often arranging them in order from the layer with the broadest receptive field (i.e.,
the greatest distance between the first and last elements of the kernel) to the one with the narrowest.
This strategy allows the model to capture large-scale dependencies in earlier layers, while later layers
specialize in detecting small, localized patterns. Perhaps the most well-known example of such an
architecture is AlexNet, which was employed for object detection and recognition in images and had a
profound impact on the development of deep learning. This architectural principle can be adapted for
time series analysis, which in our case corresponds to audio signals.</p>
      <p>It can be observed that the lowest loss was achieved with the configuration 7ks-5ks-3ks and dilation=2;
that is, in a model where the first convolutional layer has a kernel size of 7, the second 5, and the third
3. This demonstrates that employing a combination of layers with varying kernel sizes is meaningful
and leads to improved performance.</p>
      <p>Pooling layers are frequently employed in convolutional neural networks. They allow an increase
in the number of filters in convolutional layers without proportionally increasing the total number of
model parameters. We tested this approach on the three-layer architecture obtained in the previous
stage.</p>
      <p>t can be observed that the use of AvgPool1d slightly improved the model’s accuracy, though the efect
is not substantial. Thus, while this approach can be efective in general, in our specific case it does not
produce a radical improvement.</p>
      <p>In convolutional models, it is common to add one or more dense layers after the convolutional layers
to perform a form of “analysis” on the extracted features. We will attempt to determine the optimal
number of these dense layers by incorporating them into the model developed in the previous stage.
It is important to note that we will maintain the total number of parameters close to 50,000, which
requires slightly reducing the size of the convolutional layers.</p>
      <p>It can be observed that adding fully connected layers, in this case, does not contribute to achieving
lower loss values.</p>
    </sec>
    <sec id="sec-6">
      <title>6. UNet model</title>
      <p>When training the UNet model, an additional crucial modification to the dataset is required, beyond
merely reshaping the target data. First, in the development of the linear and convolutional models, we
used an input window of 201 samples to provide the model with equal context on both sides of the
target sample. However, for the UNet model, this window length is unnecessary because the model
restores an entire segment at once. Second, each encoder level reduces the window length by half;
therefore, to ensure correct operation of the algorithm, it is preferable to select an input window length
that is divisible by two at each encoder stage. Consequently, we slightly reduce the window length
from 201 to 192, which is the product of 64 and 3, allowing the encoder to accommodate up to six levels
of depth.</p>
      <p>We will test the model based on the UNet architecture across diferent parameter sizes. Unlike the
analogous testing conducted for the linear model, this evaluation will be performed on significantly
larger parameter counts. For comparison, we will also test the model using the same number of
parameters employed in the convolutional model experiments—50,000. Additionally, in this architecture,
it is possible to vary the number of “depth levels” in the encoder and, correspondingly, in the decoder;
however, for the present tests, we will fix this number at three.</p>
      <p>It can be observed that the model with 400,000 parameters, given suficient training time, achieves
significantly better results than the model with 50,000 parameters and slightly outperforms the model
with 200,000 parameters, likely due to the need for a larger dataset and extended training time. A similar
situation occurs with models containing 1,200,000 and 5,000,000 parameters, which require substantially
greater computational resources for training. However, the achieved loss values for these larger models
still remain higher than those obtained during testing of the linear and convolutional models, due to the
same limitations—high resource demands for training. Consequently, further experiments will focus on
architectures with 400,000 parameters.</p>
      <p>As noted above, in this architecture we can experiment with the number of levels, that is, the number
of convolutional layers in the encoder (and, correspondingly, in the decoder, since they are symmetrical).</p>
      <p>It can be observed that the model with four depth levels demonstrates superior performance compared
to the other models, suggesting that this depth may be “optimal” for our task. Next, we will test the
UNet architecture with four depth levels, 400,000 parameters, and varying convolutional kernel sizes.</p>
      <p>It can be observed that the minimum achieved loss increases with larger kernel sizes, indicating that
the optimal choice is to retain the initial kernel size of 3. Although the loss values obtained with this
architecture are higher than those achieved with the linear and convolutional models, it is important to
note that from a practical application perspective, this architecture may be more efective in certain use
cases due to its ability to reconstruct a larger number of samples simultaneously.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Results analysis</title>
      <p>Two neural network architectures were developed and tested for reconstructing the central sample
within a sliding window over an audio signal:
• Linear architecture: A model where the primary computations are performed by fully connected
(dense) layers.
• Convolutional architecture: A model in which the main computations are carried out by
convolutional layers.</p>
      <p>The optimal configuration for the linear model was identified as follows:
• Activation function: Tanh
• Number of hidden layers: 3
• Total parameters: 50,000</p>
      <p>Configurations with a greater number of layers and parameters were also tested; however, they
produced inferior results. This is typically attributable to insuficient training data or an inadequate
number of training epochs. Given the limitations in computational resources, a significantly larger
dataset could not be employed. Additionally, training more complex models generally requires a lower
learning rate, while the increased number of parameters necessitates longer training times. Considering
these constraints, fully training large models to their optimal capacity would have demanded a duration
incompatible with rapid parameter adjustments and extensive testing of multiple configurations, or
would have been infeasible given the available resources.</p>
      <p>Within the scope of the experiment, the linear model was tested with pooling layers inserted between
the fully connected layers; however, this approach did not result in any improvement in the loss function.</p>
      <p>The convolutional model, in its optimal configuration, demonstrated significantly better performance
than the linear model, achieving approximately 15% lower loss in one of the tested configurations.
Specifically, the minimum loss value for the convolutional model was obtained with the following
parameters:
• Dilation = 2 across all layers
• Layer configuration:
– First layer: kernel_size = 7
– Second layer: kernel_size = 5
– Third layer: kernel_size = 3
• Application of AvgPool1d following each convolutional layer
• Three hidden layers</p>
      <p>During the development of the convolutional model, we encountered the same limitations regarding
model size as observed with the linear model. Therefore, it is reasonable to assume that with greater
computational resources, even better performance could be achieved.</p>
      <p>An architecture based on UNet was also developed and tested for reconstructing an entire segment
of the audio signal at once. This model exhibited inferior performance in terms of the minimum
achieved loss value; however, it is important to note that it represents a considerably more complex
architecture than the previously tested linear and convolutional models. Consequently, it likely requires
substantially more data and training time. Additionally, this model reconstructs an entire signal segment
simultaneously—192 samples instead of a single sample as in the linear and convolutional models—which
may be advantageous in certain applications.</p>
      <p>The best results in testing the UNet-based architecture were obtained under the following
configuration:
• 4 levels in the encoder (and, correspondingly, in the decoder)
• 400,000 parameters
• Convolutional kernel size of 3 at each encoder level
• Pooling layer type: MaxPool1d</p>
      <p>We now proceed to evaluate the best-performing model (in terms of the achieved loss) on a real audio
recording to assess its practical performance. For this purpose, we selected a fragment of an audio track
in WAV format and applied the same preprocessing operations used during dataset preparation, thereby
producing a degraded WAV file with reduced quality. The trained model was then applied to this audio
fragment, and we compared the spectrograms of the degraded and the reconstructed signals.</p>
      <p>A detailed examination of the presented spectrograms reveals that, although some additional noise
components appear in the reconstructed audio, a considerable proportion of the smaller noise artifacts
are efectively suppressed. This observation indicates the potential of the tested architecture.</p>
      <p>It should be noted that during the course of this study, no additional evaluation methods for assessing
model performance were implemented — in particular, perceptual metrics or listening tests were not
conducted. This limitation is primarily due to the need for professional audio equipment and controlled
acoustic environments to ensure reliable perceptual assessment. Moreover, such metrics inherently
involve a degree of subjectivity, as they rely on human perception and auditory judgment, which can
vary significantly across listeners.</p>
      <p>For these reasons, the study relied on the analysis of spectrograms as the primary means of
evaluating the model’s output quality. Spectrogram-based evaluation provides an objective, reproducible
representation of the audio signal’s frequency–time characteristics, allowing for a more systematic
comparison between distorted and reconstructed signals. While this approach does not fully capture
perceptual nuances, it ofers a consistent and technically sound basis for assessing improvements in
signal reconstruction within the scope of the current research.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusions</title>
      <p>As a result of the conducted study, three distinct neural network architectures were developed and
compared for the task of restoring the quality of audio signals degraded by compression into lossy
formats.</p>
      <p>A data preparation methodology was designed, involving the conversion of audio files into MP3
format with a low bitrate (96 kbps) followed by reconversion into WAV format, thereby generating
pairs of "degraded–original" signals for model training. A sliding window approach of fixed length was
proposed, enabling the restoration of the central sample in the case of the linear and convolutional
models, or an entire segment in the case of the UNet-based architecture.</p>
      <p>As a result of the comparative analysis of the three architectures, the following observations were
made:
• Convolutional model: This architecture demonstrated the best performance, achieving an MSE
of 1.37e-4, which is approximately 15% lower compared to the linear model (MSE 1.61e-4). The
most efective configuration of the convolutional network was identified as one comprising three
layers with varying kernel sizes (7–5–3) and a dilation factor of 2.
• Linear model: The linear architecture produced satisfactory results with an optimal
configuration consisting of three hidden layers, the Tanh activation function, and approximately 50,000
parameters.
• UNet-like model: Although this architecture exhibited a higher loss value (MSE 3.11e-4), it
presents a potential advantage in its ability to reconstruct entire audio segments (192 samples) at
once, which may be beneficial in specific practical applications.</p>
      <p>A visual analysis of the spectrograms of the restored audio signals confirmed the efectiveness of the
developed models, particularly through the observed reduction of “minor” noise in the reconstructed
signals.</p>
      <p>The primary limitation of this study was the available computational resources, which did not allow
for the full training of more complex models with a larger number of parameters. Nevertheless, the
experimental results obtained using smaller models and a limited dataset convincingly demonstrate
the potential of neural networks for restoring the quality of audio signals. This opens avenues for
future research, including the use of larger datasets, more sophisticated architectures, and parameter
optimization to achieve state-of-the-art performance.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly in order to Grammar and spelling
check. After using these tool, the authors reviewed and edited the content as needed and takes full
responsibility for the publication’s content.
[3] Q. Liu, A. H. Sung, M. Qiao, Spectrum steganalysis of wav audio streams, in: P. Perner (Ed.),
Machine Learning and Data Mining in Pattern Recognition, Springer Berlin Heidelberg, Berlin,
Heidelberg, 2009, pp. 582–593.
[4] brandenburg karlheinz, mp3 and aac explained, journal of the audio engineering society (1999).
[5] S. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Transactions on
acoustics, speech, and signal processing 27 (2003) 113–120.
[6] J. Laroche, M. Dolson, New phase-vocoder techniques for pitch-shifting, harmonizing and other
exotic efects, in: Proceedings of the 1999 IEEE Workshop on Applications of Signal Processing to
Audio and Acoustics. WASPAA’99 (Cat. No. 99TH8452), IEEE, 1999, pp. 91–94.
[7] O. Casey, R. Dave, N. Seliya, E. R. S. Boone, Machine learning: Challenges, limitations, and
compatibility for audio restoration processes, in: 2021 International Conference on Computing,
Computational Modelling and Applications (ICCMA), IEEE, 2021, pp. 27–32.
[8] K. Fisher, A. Scherlis, Wavemedic: convolutional neural networks for speech audio enhancement,
2016.
[9] S. R. Park, J. Lee, A fully convolutional neural network for speech enhancement, arXiv preprint
arXiv:1609.07132 (2016).
[10] Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, L. Xie, Dccrn: Deep complex
convolution recurrent network for phase-aware speech enhancement, arXiv preprint arXiv:2008.00264
(2020).
[11] K. Park, T. Mulc, Css10: A collection of single speaker speech datasets for 10 languages, arXiv
preprint arXiv:1903.11269 (2019).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Herby</surname>
          </string-name>
          , Jonas and Jonung, Lars and Hanke, Steve H.,
          <article-title>Did lockdowns work? : The verdict on Covid restrictions, number 1 in IEA Perspectives</article-title>
          , IEA,
          <year>2023</year>
          . URL: https://iea.org.uk/publications/ did-lockdowns
          <article-title>-work-the-verdict-on-covid-restrictions/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <article-title>A tutorial on mpeg/audio compression</article-title>
          ,
          <source>IEEE MultiMedia 2</source>
          (
          <year>1995</year>
          )
          <fpage>60</fpage>
          -
          <lpage>74</lpage>
          . doi:
          <volume>10</volume>
          .1109/ 93.388209.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>