<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Speech and Language Impairment Detection by Means of AI-Driven Audio-Based Techniques</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Corvitto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo Faiella</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Napoli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adriano Puglisi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samuele Russo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer, Control and Management Engineering, Sapienza University of Rome</institution>
          ,
          <addr-line>Via Ariosto 25, Roma, 00185</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Psychology, Sapienza University of Rome</institution>
          ,
          <addr-line>Via dei Marsi 78, Roma, 00185</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <fpage>19</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>Speech and Language Impairments (SLI) afect a large and heterogeneous group of people. With our work, we propose a novel, easy, and immediate detection tool to help diagnose people who sufer from SLI using speech audio signals, along with a new dataset containing English speakers afected by SLI. In this work, we experiment with feature extraction methods such as Mel Spectrogram and wav2vec 2.0, as well as classification methods such as SVM, CNN, and linear neural networks. We also work on data audio augmentation trying to overcome the very common limitations imposed by data scarcity in the medical field. The overall results indicate that the wav2vec 2.0 feature extractor, paired with a linear classifier, provides the best performance with a reasonably high accuracy of over 96%.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;SLI</kwd>
        <kwd>AI</kwd>
        <kwd>audio</kwd>
        <kwd>healthcare</kwd>
        <kwd>speech</kwd>
        <kwd>learning disease</kwd>
        <kwd>feature extraction</kwd>
        <kwd>data augmentation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        sification systems. Automatic classification technologies
are widely applied in voice assistants [
        <xref ref-type="bibr" rid="ref36">8</xref>
        ], chatbots [9],
The rapid development of the use of Artificial Intelligence smart safety devices [
        <xref ref-type="bibr" rid="ref106 ref111 ref130 ref16 ref2 ref25 ref45 ref54 ref8 ref96">10, 11</xref>
        ], and in diferent real-world
(AI) techniques in a broad range of scientific fields has environments [
        <xref ref-type="bibr" rid="ref76">12, 13, 14</xref>
        ].
helped solve real-life problems, in particular, the new ad- Our project aims to conciliate these two worlds and
vancements revolutionized a wide variety of areas such as design a Deep Learning (DL) model that can detect, from
Natural Language Processing (NLP) [1], computer vision a given audio input, if the speaker could be afected by
[
        <xref ref-type="bibr" rid="ref115">2, 3</xref>
        ], robotics and many more. Due to the huge volume a speech and language impairment. Individuals with a
of medical data being generated worldwide, there is a Speech and Language Impairment (SLI), generally,
declear need for eficient use of this information to bene- spite normal hearing, normal nonverbal intelligence,
adift health sectors around the world [ 4, 5]. The medical equate social functioning, and no obvious signs of brain
community has taken strong notice of the potential of injury represent a heterogeneous group of people with
these new technologies in AI. Machine learning (ML) significant dificulty in learning languages [
        <xref ref-type="bibr" rid="ref41">15</xref>
        ]. One of
thrives in areas where there are lots of data, therefore the defining characteristics of SLI is speech disfluency,
ML is one of the essential and most efective tools in more specifically impaired acquisition of pattern-based
analyzing highly complex medical data [6]. For example, components in language, such as morphology, syntax,
analyzing medical data originating from disease diagno- and some aspects of phonology such as stuttering. This
sis with the aid and benefits given by these tools could commonly used definition leads to early hypotheses
rebe a lot more financially eficient. In healthcare, it is also garding the etiology of SLI that an impaired
languagevital that diseases are detected early on during diagnosis specific learning mechanism underlies language
developand prognosis. The success of these AI methods has also ment and disorders [16, 17, 18]. This disorder is deemed
spread across other domains, including speech recogni- “primary” or “specific” when there is no clear
explanation and the music recommendation task [7]. Due to the tion for these lags in language skills, a defining
characrelevance of such systems in our day-to-day lives, there teristic of primary language disorder is that its cause is
is an increasing need for efective and eficient audio clas- unknown [19]. Language disorders are also linked to
a heightened risk for psychiatric concerns, attentional
IInCfYoRrmIMaEtic2s,02M4a:th9tehmaIntitcesr,naantidonEanlgiCnoeenrfienrgen.CceatoafnYiae,aJrulylyR2e9p-oArutsguosnt dificulties, social-behavioral problems, and learning
dis1, 2024 abilities [20, 21]. Many current trends in audio signal
pro$ corvitto.1835668@studenti.uniroma1.it (L. Corvitto); cessing rely on data-driven machine learning approaches
faiella.1835950@studenti.uniroma1.it (L. Faiella); to achieve state-of-the-art results [
        <xref ref-type="bibr" rid="ref56">22, 23, 24</xref>
        ]. However,
cnapoli@diag.uniroma1.it (C. Napoli); puglisi@diag.uniroma1.it the quantity and quality of available data influences
heav(A.0P0u0g0l-i0s0i)0;2s-a3m33u6e-l5e8.r5u3ss(Co@.Nuanpiroolmi);a010.i0t9(-S0.0R07u-s6s3o0)7-7194 ily the achieved performance for a task. Depending on
(A. Puglisi); 0000-0002-1846-9996 (S. Russo) the specific task, as for our case study, such data can
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License often be hard to obtain and costly to label particularly in
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
the audio domain. As a consequence, researchers often reduction [33]. DA is key when dealing with problems
have to deal with datasets of insuficient size or quality. regarding audio signals because the Convolutional
NeuUsually, diagnosis of this type of problem is carried out ral Network (CNN) is the most widely used model in
with human experts, with special in-loco tests [25, 26] audio applications and when faced with small datasets,
or with the aid of tools such as electroencephalogram CNN’s capacity for information retention becomes a flaw;
(EEG) [27]. We want to design an easy and accessible the models memorize the training data and lose
performodel that can detect if a person could be afected by mance on new data [34, 35]. In addition to increasing
an SLI without having to go through complex and time- generalization capabilities, the augmentation of data also
consuming procedures. In this manner, such a model allows the designed system to improve data significance,
could also be implemented in robots from a human-robot regardless of the available data samples [36, 37]. These
interaction (HRI) perspective, allowing the machine to strategies include methods on raw audio signals, as well
detect people with SLI and change its behavior and form as applying other techniques on samples converted into
of interaction accordingly. spectrograms or even more complex approaches such
      </p>
      <p>
        This study proposes an analysis of a novel, yet simple, as interpolation and nonlinear mixing on the spectrum.
approach of using exclusively audio recordings for SLI We will now list and briefly explain the most used audio
detection. Specifically, in Section 2 we start by exploring augmentation techniques.
the current literature, and then we will talk about the Pitch Shifting. The tone of each audio signal in the
problems faced in collecting our data and how we handled dataset is lowered or raised by a factor preserving its
them in section 3. After that, in Section 4, we will go duration.
through an analysis of the techniques and models used to Time Stretching. The audio sample is slowed down or
perform the detection, the trials and results we obtained sped up by a ratio without altering the pitch drastically.
from them in Section 5, and then we will discuss the Time Shifting. Time is shifted to the left or to the right
limits of our approach in Section 6. We will finally draw by a random factor or by a predetermined amount.
our conclusions in Section 7. Volume Adjustment. The volume of the audio file is
altered, there is a change in loudness, or sometimes a
dynamic range compression is applied.
2. Related Works Noise addition. Noise is introduced into the samples,
other than a simple random Gaussian noise there are
In the ever-evolving landscape of computer science and many types of noises such as white noise [
        <xref ref-type="bibr" rid="ref11">38</xref>
        ], babble
artificial intelligence, the domains of audio data augmen- noise, static noise [39], factory noise, etc.
tcahtaionngsesanadndfearetuvroeluetxitornascttihonanakrse tuondgerorguonidnbgrveearkyinrgapried- rSapteeeadnUdpl.ateTrhreetsuirgnneadl iastrtehseamorpilgeidnaalt
saapmrepsleintgsarmatpel,inregsearch and advancements. In the following sections, we sulting in a speed change.
wofiltlhdeeslevfieeldinst.o the story and explore the state of the art iFnipltuetrainugdi.o. SMevoestraolf kthinedcsoomf mfiltoenrsfiltaerres aapreplbieadndt-opathsse,
band-stop, high-pass, high-shelf, low-pass, low-shelf, and
2.1. Audio Data Augmentation peaking filters.
      </p>
      <p>One of the most important challenges in developing an This topic is so important that researchers also
develeficient and efective audio classification system is ac- oped and designed methods that generate entirely new
cessing a large and well-annotated dataset. One of the samples, for example with the aid of a Generative
Advermain obstacles in developing sound classifications is a sarial Network (GAN) in [40] people created new variants
lack of a suficient quantity of labeled data. This is due of the audio samples that already existed in their dataset
to the following main reasons: class imbalance, data pri- and then utilized an evolutionary algorithm to search
vacy issues, time constraints involved in data collection, the input domain to select the best-generated samples, in
high dependency on expertise for efective annotation, this way they were able to generate audio in a controlled
etc. [28, 29, 30] Data Augmentation (DA) is defined as the manner that contributed to an improvement in
classificacreation of new data by adding deformations to increase tion performance of the original task. One very recent
the variety of the data so that these deformations do not DA method proposed by Google is SpecAugment [41],
change their semantic value. It is well known that DA can in this method, the two-dimensional spectrum diagram
improve the algorithm’s performance, tackle the issue is treated as an image with time on the horizontal axis
of overfitting [ 31, 32], and improve the generalization and frequency on the vertical axis. Encoder-decoder
netability of Deep Neural Networks (DNN); this happens be- works are becoming very popular in fields diferent from
cause DA averages over the orbits of the group that keeps NLP, this is because they can convert a high-dimensional
the data distribution invariant, which leads to variance input into a lower-dimensional vector in latent space,
researchers in [42] have experimented with a Long Short
Term Memory (LSTM) based auto-encoder to produce
artificial data.
2.2. Audio Feature Extraction and Models
trons (MLP) were very useful in person identification
using speech and breath sounds [53], Hidden Markov
Models (HMM) [54], logistic regression and linear
discriminant analysis [55] and others. Some studies exploited the
efectiveness of multiple simpler methods with ensemble
methods such as random forests [56, 51], XgBoost [57],
and so on. Unfortunately, considering the complexity
of sound and the need to sometimes train an extremely
sensitive classifier that can identify diferent
representations of sound features, traditional ML still sufers in
these kinds of tasks from having less complex models. In
this case, the choice of DL methods has been proven to be
more eficient. DL methods difer from traditional ones
because they can extract meaningful features from data
through the application of a hierarchical structure [58]
CNNs were able to achieve significant and more accurate
training results [59]. People tried to combine the best of
these two worlds by implementing hybrid methods, for
example, researchers merged an SVM and a GRU-RNN
in [60].</p>
      <p>It should be noted that data augmentation is not the only
way to reduce overfitting and improve the generalization
ability of DL models. Model structure optimization,
transfer learning, and One-shot and Zero-shot learning are
also known strategies that deal with overfitting from
different aspects. We will now focus on the most common
processing flow of audio classification: preprocessing the
original audio data, feature extraction, and feeding the
features into the DL model. Audio signals have very high
dimensionality, so thousands of floating point values are
required to represent a short audio signal, raising the
need for exploring dimensionality reduction and feature
extraction methods. The degree of how great or poor a
model performs is also determined by the choice of
features used feature representation is crucial to improve the
performance of learning algorithms in the sound
classification task. One of the first features that comes to mind
when thinking of an audio signal is the spectrogram, its 3. Dataset
characteristics have been widely used by previous
researchers in diferent domains of sound classification, In the medical field, in particular, regarding specific
probsuch as heartbeat sounds to detect heart diseases [43]. lems such as the one presented in this paper, data is not
Another method used to extract features implements the always freely available or available at all. This is mostly
Mel-Frequency Cepstrum (MFC), which is a representa- due to privacy concerns [61, 62]. Another important
reation of the short-term power spectrum of a sound, based son, which is also related in some ways to privacy [62],
on a linear cosine transform of a log power spectrum lies in the overall low level of digitization of healthcare
on a nonlinear mel scale of frequency, where the Mel- information [63]; in fact, according to Gopal G. et al.
Frequency Cepstral Coeficients (MFCC) were successful [64], healthcare has the lowest level of digital
innovain representing sounds for the detection of respiratory tion compared to other industries, such as media, finance,
diseases [44]. Some methods that also use the MFC are insurance, and retail, contributing to limited growth of
the long-mel [45], mel filter bank energy [ 46], inverted labor productivity. In addition to this, it is also worth
MFCC [47], and many more. Although mel spectrogram noting that not every dataset containing the desired
medand MFCC are commonly used, people also implement ical information is also in the desired format, in which
bag of audio words [48], Discrete Gabor Transform (DGT) case the only remaining option is to create an entirely
audio image representation [49], ZCR, entropy of energy, new dataset from scratch, that is what we did.
spectral centroid, spectral spread, spectral entropy [50],
and so on. 3.1. Data Collection</p>
      <p>Classification is a common task in ML and pattern
recognition. DL methods applied in these tasks, such as The process of collecting audio data is a pivotal phase
CNN models, often do not perform as well as more tradi- in this research. For our dataset, we aimed to collect a
tional ML methods such as random forest, Adaboost, etc., suficient amount of pure, non-multimodal, audio data in
especially in small data [51]. On the other hand, typical a waveform representation. Audio data can be stored in
ML algorithms, such as ensemble classifiers have been various formats, each with its characteristics, trade-ofs,
shown to learn features better and adapt more with im- and use cases. Common audio formats include
Waveproved generalization abilities even in the case of small form Audio File Format (WAV), MPEG-1 Audio Layer
and imbalanced datasets. Over the past years, difer- 3 (MP3), Free Lossless Audio Codec (FLAC), and more.
ent ML algorithms have been used for detecting sound These formats difer in terms of compression, quality, and
events and medical sounds, and the achieved results were compatibility. For this study, we opt for the WAV format
of great significance. Classifiers, such as Support Vec- [65], which is an uncompressed audio file format,
develtor Machine (SVM), have shown to be very efective in oped by IBM and Microsoft, that eficiently stores audio
sound classification tasks [ 52], also MultiLayer Percep- data in a waveform representation without any loss of
information from them, keeping just human speech sounds
(with or without background noise). After that, we
analyzed the time windows by dividing each one of them
into smaller ones containing the speech of one single
Figure 2: Audio sample with noise from our dataset person each. Even if there exist diferent tools available
to detect human speech, considering the scarcity of data
we sufer, we decided to perform this step manually to
formation. Thanks to its characteristics, which guarantee be sure that the quality of our dataset is not afected.
the highest amount of information for an audio signal, Secondly we split the time windows that we obtained
WAV is the audio format used as input by wav2vec 2.0 in 3-second clips. We chose this length as a trade-of
be[66], a state-of-the-art speech model developed by the tween a suficient length, to capture fluency information
Facebook AI Research group (FAIR) that is one of the and a brief duration. Our decision was also based on the
models used in this work. standard approach used in the state-of-the-art working</p>
      <p>The data collection process began with the identifi- with wav2vec 2.0 in these kinds of tasks [68, 69]. Then
cation of audio samples containing English speakers af- these clips were saved in two diferent subsets,
creatfected by Speech and Language Impairment (SLI) orig- ing the Train and the Test set, ensuring that the same
inating from diferent conditions. This diverse dataset speakers do not overlap in both datasets.
was intentionally curated to optimize the performance Finally the acquired data was augmented to increase
of SLI detection. By including speakers with a range of its dimension. We applied the following audio
augmenimpairments, the model is exposed to a broad spectrum tations techniques: Time shifting, Time stretching, Pitch
of speech patterns and anomalies, thereby enhancing its shifting, and Noise addition, using Gaussian noise. To do
ability to accurately detect SLI in real-world applications. so, we used the python library audiomentations [70]. For
To source such data, we turned to YouTube, a vast and Time shifting we resampled the time windows shifting
user-friendly repository of video and audio content. The the starting time further by 1.5 seconds; For Time
stretchvideos found were then converted into audio files in WAV ing we slowed down the speed of the audios by a ratio
format using an online converter. of 0.8; For the Pitch shifting we both lowered and raised</p>
      <p>We finally paired the collected data with a subset of the pitch tone by a value of 3, obtaining for each clip two
the LibriSpeech dataset [67] containing healthy English additional ones; Finally for the Noise addition, we added
speakers only. a 0.01 amplitude Gaussian noise. Audio waveforms
before and after noise addition are shown in Fig. 1 and
3.2. Data Preprocessing Fig. 2. All the augmentation techniques were applied on
the original audio; Time shifting was directly applied on
To feed the waveform signals to the model, we needed the time windows, while the other ones on the initial 3
to ensure that they were appropriately prepared and pro- seconds clips.
cessed. Efective data preprocessing is fundamental to The number of samples in the created dataset is shown
enhancing the model’s performance, as it directly im- in Table 1, while in Table 2 we collect the audio data
pacts the model’s ability to extract meaningful patterns augmentation techniques used and their respective
paand insights from raw input data. This was performed rameters.
in diferent steps. Firstly we identified diferent time
windows from each audio file to cut out unnecessary
in</p>
      <sec id="sec-1-1">
        <title>3.3. Data Management</title>
        <p>The dataset contains audio files in the WAV format, its
data is afected not only by its advantages but also by its
drawbacks. The complete dataset, which comprehends
both original and augmented data, was too large to be
loaded in an online manner using the original files. To
overcome this problem we loaded the data in batches
and concatenated them in subsets that were saved in
the .arrow format [71], a columnar memory format for
lfat and hierarchical data, organized for eficient analytic
operations. In this way, large data can be saved, loaded,
and processed avoiding memory usage problems.
The best way to approach a problem is to know deeply
every factor that influences it and how the key
components work, after that, one can tackle it and try to capture
its essence with the maximum capabilities. In the
following subsections, we present a brief description of the
techniques we used and the models we implemented.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Models and Techniques Used</title>
      <p>4.2. Wav2vec 2.0
The way humans hear frequencies in sound is known as
pitch, it is a subjective impression of the frequency. They
do not perceive frequencies linearly, on the contrary,
humans are more sensitive to diferences between lower
frequencies than higher ones. For example, the
diference between audios of frequency 100 and 200
is way bigger than 1000 and 1100, even though
the absolute diference is the same amount. Humans
perceive sounds on a logarithmic scale rather than a linear
scale. The Mel Scale [72] was developed to take this into
account by conducting experiments with a large number
of listeners. It is a scale of pitches, such that each unit
is judged by listeners to be equal in pitch distance from
the next. The human perception of the amplitude of a
sound is called loudness, similarly to frequency, also
loudness is heard logarithmically rather than linearly. The
Decibel scale is used to measure the loudness of a sound,
for example, a sound with an amplitude of 20 is 10
times louder than one with an amplitude of 10. We
can see that, to deal with sound realistically, we need to
use a logarithmic scale via the Mel Scale and the Decibel
Scale when dealing with Frequencies and Amplitudes in
our data. Spectrograms are generated from sound signals
using Fourier Transforms. A Fourier Transform (FT) [73]
is a mathematical formula that allows us to decompose
the signal into its constituent frequencies and displays
the amplitude of each frequency present in the signal.
Spectrograms are generated from sound signals using
FTs. In other words, an FT converts the signal from the
time domain into the frequency domain, and the result is
called a spectrum. A spectrogram consists in dividing the
sound signal into smaller time segments, then applying
the FT to each segment, and finally, the combination of
these segments in a single plot is called spectrogram. A
Mel Spectrogram makes two important changes relative
to a regular spectrogram that plots frequency vs time: it
uses the Mel scale instead of frequency on the y-axis and
uses the Decibel scale instead of amplitude to indicate
color. In Fig. 3 we can see a normalized version of the Mel
spectrogram of one of the audios present in the dataset.
Wav2vec 2.0 [66] is an exceptional tool that learns
powerful representations from speech mimicking the human
learning experience. People start, in fact, since the early
stages of their lives comprehending language without
labeled data, i.e. kids learn from listening to adults around
them. It is also able to outperform state-of-the-art models
while using 100 times less labeled data, thus
demonstrating the feasibility of training without huge amounts of
labeled data which is very hard to achieve in a field
dealing with a complex medium such as audio.</p>
      <sec id="sec-2-1">
        <title>4.3. Classification Methods</title>
        <p>Classification is the part that stands out the most in an
entire model because it outputs the labels that are used
to compute the evaluation metrics, even though it is the
most noticeable part of a model, in our case they are just
the final piece of the puzzle since most of the work is
done in the previous steps of the pipeline; still, we want
to pay some attention to the type of classifiers we used
in our work.</p>
        <p>Support Vector Machine (SVM) [74] is one of the
Figure 4: Wav2vec 2.0 pipeline ifrst algorithms learned by every ML expert, it is
simple yet it can achieve excellent results, especially with
small amounts of data where other ML algorithms tend
to have some dificulties. The objective of the support</p>
        <p>The model can be visualized in Fig. 4 and next, we will vector machine algorithm is to find a hyperplane in an
describe its components. N-dimensional space ( − the number of features) that</p>
        <p>Multi-layer convolutional feature encoder. It distinctly classifies the data points. To separate the two
consists of several blocks containing a temporal con- classes of data points, many possible hyperplanes could
volution followed by layer normalization and a GELU be chosen. SVM finds a plane that has the maximum
maractivation function. gin, i.e. the maximum distance between data points of</p>
        <p>Context network. It follows the Transformer ar- both classes. Maximizing the margin distance provides
chitecture, diferently from a normal Transformer that some reinforcement so that future data points can be
uses fixed positional embeddings, a convolutional layer classified with more confidence. The biggest dificulty
is used instead, and it acts as a relative positional embed- encountered when testing the SVM is that even with low
ding. The output of the convolution followed by a GELU amounts of data the model had memory issues, since
auis added to the inputs and then a layer normalization is dio features are extremely large and with multiple classes,
applied. while SVM excels with data that has fewer classes, thus</p>
        <p>Quantization module. It discretizes the output of making it hard to fully exploit SVM’s strengths.
the feature encoder to a finite set of speech represen- One of the best and most eficient methods to generate
tations via product quantization. Product quantization labels from an ML model is adding a linear layer at the
amounts to choosing quantized representations from mul- end of the pipeline, that is what we did with our wav2vec
tiple codebooks and concatenating them. The Gumbel 2.0 feature extractor, we have included a linear classifier
softmax enables choosing discrete codebook entries in a  (, , ) =  ·  +  and we trained its weights to
fully diferentiable way. output two types of labels, one for people afected by a</p>
        <p>The feature encoder  :  →  takes as input the raw SLI and one for the others.
waveform  and outputs the latent speech representa- Resnet34 is a very famous residual neural network
tions 1, ...,  for  time steps, then they are fed to the that was pre-trained on ImageNet-1k and was released
transformer  :  →  that captures information from by Microsoft [75], thanks to residual learning and skip
the entire sequence and outputs context representations. connections this type of model can be much deeper than
The output of the feature encoder is also discretized to normal convolutional neural networks. We decided to
 with a quantization module to represent the targets ifne-tune this model with the features extracted with the
in the self-supervised objective. During the model’s pre- log mel spectrogram from our dataset.
training a part of the latent speech representations that
are generated from the feature encoder are masked, and
then the model learns the representations of speech au- 5. Results
dio by solving a contrastive task, which requires
identifying the true quantized latent speech representation In this section, we will describe the diferent architectures
for a masked time step within a set of distractors. After that we tested in detail and then we will comment on the
pre-training on unlabeled speech, the model is fine-tuned obtained results.
on labeled data with a Connectionist Temporal
Classification (CTC) loss.</p>
      </sec>
      <sec id="sec-2-2">
        <title>5.1. Architectures</title>
        <sec id="sec-2-2-1">
          <title>Our first approach was to use the wav2vec 2.0 model,</title>
          <p>in particular the pre-trained wav2vec2-base model from
HuggingFace [76], to perform Feature Extraction on the
pre-processed non-augmented dataset and then use a Table 3
SVM, the Support Vector Classifier (SVC) model from Parameters used to compute the Spectrogram
scikit learn [77], to perform the classification process
taking the extracted features in input. As it was explained in Log Mel Spectrum Parameters
the previous section 4, wav2vec2.0 takes a raw waveform Sample rate 22050
signal as input, 3 seconds clips in WAV format in our Windows length 2048
case, then extracts audio features from them following Hop length 512
what it had learned in its previous training. The extracted N mels 128
features were then standardized using the StandardScaler
from scikit learn, removing the mean and scaling them to Table 4
unit variance. The standardization of a dataset is a com- Architectures Accuracy
mon requirement for many ML estimators: they might
behave badly if the individual features do not more or less Models Accuracy
look like standard normally distributed data (e.g. Gaus- LASSO (Full Model) [78] 0.84
sian with 0 mean and unit variance). Finally, we fitted 1NN CHI Strategy [79] 0.8832
the SVM using a linear kernel. LMT BL Strategy [79] 0.9269</p>
          <p>Using the SVM model as a classifier was our first at- MLP BL Strategy [79] 0.9013
tempt to cope with the limited number of samples at our NB BL Strategy [79] 0.9269
disposal. Once the dataset was augmented we ceased CNN [80] 0.8421
to use the SVM due to its intrinsic limitations at work- Our Models Accuracy
ing with large datasets; so we opted for a complete DL Wav2vec2.0 + SVM 0.6627
approach. Wav2vec2.0 + FC 0.9661</p>
          <p>For our second architecture, we substituted the classi- Log Mel Spectrogram + CNN 0.9362
ifer head with a simple Fully Connected ( FC), or linear,
layer, keeping the wav2vec 2.0 model to perform the
Feature Extraction, this time, on the augmented dataset. probably due to the magnitude of the feature space
exWe trained the model for 5 epochs through the Trainer tracted by the wav2vec 2.0 model.
class by HuggingFace on a batch of 32 samples each, set- Using, instead, an augmented dataset together within
ting the learning rate to 2 − 5 after a warm-up period a DL approach we manage to reach a very high value
at a ratio of 0.1 and decreasing its value linearly till the of accuracy, the highest of our models. The wav2vec
end of the training. 2.0 feature extractor, having enough data to work with,</p>
          <p>The last architecture tested was a CNN, more precisely managed to extract the key features and information
resnet34, that received as input the log mel spectrogram needed to correctly identify which voice belongs to a
of the audios and generated as output the labels of the healthy speaker or an impaired one.
given audio. All the procedures to extract the spectro- The CNN model that was fine-tuned with Log Mel
gram were carried on with the librosa library, firstly the Spectrum features achieved great accuracy in labeling
sample was resampled with a new rate of 22050, then samples, unfortunately, through a more accurate analysis
the mel spectrogram extracted was normalized and fi- of the confusion matrices shown in 5, 6, and 7, we
disnally scaled. Regarding the CNN, only the last layer was covered that the number of false negatives is extremely
modified, it was replaced with a linear layer that had two high compared to the false positives. In the medical field,
output channels and the whole model was fine-tuned especially for tools helping with diagnosis, it is crucial
without freezing the previous layers. Training was car- to have the smallest number of false negatives, since an
ried out for 50 epochs, the learning rate started at 2 − 4 undetected disease is much worse than a false positive,
and decayed by a factor of 10 every 10 epochs; the loss medical operators could be missing a lot of vital
anomafunction used was the CrossEntropyLoss. All parameters lies and in time they will lose trust in the system. In
used to compute the spectrum are shown in Table 3. our case recall is way more important than the
precision score, from Table 5 we can see that the CNN model
5.2. Evaluations reaches only a recall score of 0.85, on the other hand
wav2vec 2.0 achieves a better recall and F1 Score.</p>
          <p>In Table 4 we show the accuracy of our architectures,
compared with others architectures [78] As we can see,
the first model is the one with the lowest score. This
means that, despite the ability of the SVM to avoid
overiftting on the poor quantity of data provided, it cannot
accurately detect the speakers afected by SLI. This is</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>6.1. Limitations</title>
        <p>y
h
lt
a
e
H
I
L
S
93
52</p>
      </sec>
      <sec id="sec-2-4">
        <title>6.2. Future Works</title>
        <sec id="sec-2-4-1">
          <title>The lack of data quantity and quality is one of our major</title>
          <p>constraints. The problem of data scarcity has already Future works should focus on the creation of a new
been addressed in section 3 so we will now talk about dataset comprising people speaking diferent languages,
quality. since it is not yet known, to our knowledge, whether
flu</p>
          <p>In the realm of ML and DL, it has been well docu- ency problems can be generalized in all languages and a
mented that the issue of low-quality data and disparities wide age range, knowing that the features and the overall
in data collection methodologies exacerbate the inherent characteristics of the voice between children and adults
biases within the data when utilized for training algo- change in general, due to their anatomical diferences
rithms, a clear example is given by the societal or political [86].
biases reflected in word embeddings or large language Given the technological advancement in the field of
models [81, 82]. This concern arises when the data col- generative audio with astonishing tools such as the
aulected for training purposes exhibits significant varia- dio manipulation software produced by ElevenLabs [87],
tions in quality and collection techniques, resulting in which can clone voices, generate new ones, translate
a heightened vulnerability to intrinsic biases within the them into other languages, and make them read texts,
data. Such biases can subsequently propagate through new kinds of audio enhancement can be experimented
the training process, influencing the performance and with, and although they cannot be used now, because
fairness of ML and DL algorithms leading to further dis- they cannot replicate stuttering or other kinds of fluency
parities and discrimination in the real world, due to the features that characterize people afected with SLI yet,
accessibility to such tools [83, 84]. Particularly, in our they are promising tools to take into consideration for
work, the collection of English speakers afected by SLI the near future.
presents the limitation of containing mostly speakers
with American accents. In real-world applications this 7. Conclusions
can have negative efects on the model performance, for
example, the algorithm could achieve higher and better This work proposes a novel approach to Speech and
Lanresults with American people rather than with Mexican guage Impairment (SLI) detection, based solely on audio
ones, or other English-speaking minority ethnic groups and AI audio-based techniques, together within an
enof people whose accent diefrs from the standard Ameri- tirely new dataset composed of English speakers afected
can one [84]. by SLI. The results show that, even with some limitations</p>
          <p>Another limitation of our dataset is that it does not related to the scarcity of data available, Deep Learning
contain children speakers. This is because finding such methods can achieve accurate estimations on healthy
materials on the web is often dificult, and it is more or impaired speakers. In particular, wav2vec 2.0, with a
dificult to create them from scratch due to the small Fully Connected layer as the classification head, reaches
number of certified children afected by SLI and, since an accuracy of over 96% on our test set. Our findings also
they are minors, due to more strict privacy concerns. confirm that data audio augmentation techniques are
funThe most used dataset in this field [ 85] consists of one damental to training Deep Learning models adequately.
second clips of Czech speaking children, both healthy or
afected by SLI. Although this dataset could be useful for
the detection of SLI, it is limited to the Czech language
and children speakers. This kind of limitation is common
in the healthcare field, especially in SLI detection.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1111/j.1365-
          <fpage>2664</fpage>
          .
          <year>2011</year>
          .
          <year>01993</year>
          .x. doi:https://doi. Disabil.
          <volume>52</volume>
          (
          <year>2019</year>
          )
          <fpage>351</fpage>
          -
          <lpage>365</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>org/10</source>
          .1111/j.1365-
          <fpage>2664</fpage>
          .
          <year>2011</year>
          .
          <year>01993</year>
          .x. [26]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Tomblin</surname>
          </string-name>
          , Reinforcement learning in [15]
          <string-name>
            <given-names>D. V. M.</given-names>
            <surname>Bishop</surname>
          </string-name>
          ,
          <article-title>Uncommon understanding: Devel- young adults with developmental language impair-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>opment and disorders of language comprehension ment</article-title>
          ,
          <source>Brain Lang</source>
          .
          <volume>123</volume>
          (
          <year>2012</year>
          )
          <fpage>154</fpage>
          -
          <lpage>163</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          in children, Psychology Press/Erlbaum (UK) Taylor [27]
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Ahire</surname>
          </string-name>
          , Nitin,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wagh</surname>
          </string-name>
          , Eeg based identifica-
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>and Francis</source>
          ,
          <year>1997</year>
          .
          <article-title>tion of learning disabilities using machine learning</article-title>
          [16]
          <string-name>
            <surname>H. CLAHSEN</surname>
          </string-name>
          ,
          <article-title>The grammatical characterization of algorithms</article-title>
          , J Neurol Disord (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>developmental dysphasia 27</source>
          (
          <year>1989</year>
          )
          <fpage>897</fpage>
          -
          <lpage>920</lpage>
          . URL: [28]
          <string-name>
            <given-names>O. O.</given-names>
            <surname>Abayomi-Alli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Damaševičius</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Qazi,
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          https://doi.org/10.1515/ling.
          <year>1989</year>
          .
          <volume>27</volume>
          .5.897. doi:doi: M.
          <string-name>
            <surname>Adedoyin-Olowe</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Misra</surname>
          </string-name>
          , Data augmenta-
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          10.1515/ling.
          <year>1989</year>
          .
          <volume>27</volume>
          .5.897.
          <article-title>tion and deep learning methods in sound classi</article-title>
          [17]
          <string-name>
            <surname>M. L. Rice</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Wexler</surname>
            ,
            <given-names>P. L.</given-names>
          </string-name>
          <string-name>
            <surname>Cleave</surname>
          </string-name>
          ,
          <article-title>Specific language ifcation: A systematic review</article-title>
          ,
          <source>Electronics</source>
          <volume>11</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>impairment as a period of extended optional infini</article-title>
          - URL: https://www.mdpi.com/2079-9292/11/22/3795.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>tive</surname>
          </string-name>
          ,
          <source>Journal of Speech</source>
          , Language, and Hearing doi:
          <volume>10</volume>
          .3390/electronics11223795.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>Research</source>
          <volume>38</volume>
          (
          <year>1995</year>
          )
          <fpage>850</fpage>
          -
          <lpage>863</lpage>
          . URL: https://pubs.asha. [29]
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pappalardo</surname>
          </string-name>
          , E. Tramontana, Using
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          org/doi/abs/10.1044/jshr.3804.850. doi:
          <volume>10</volume>
          .1044/ modularity metrics to assist move method refactor-
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          jshr.
          <volume>3804</volume>
          .850.
          <article-title>ing of large systems</article-title>
          , in: Proceedings - 2013
          <year>7th</year>
          [18]
          <string-name>
            <surname>H. K. van der Lely</surname>
          </string-name>
          , Domain-specific cognitive International Conference on Complex, Intelligent,
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>systems: insight from grammatical-sli, Trends and Software Intensive Systems</article-title>
          ,
          <string-name>
            <surname>CISIS</surname>
          </string-name>
          <year>2013</year>
          ,
          <year>2013</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>in Cognitive Sciences</source>
          <volume>9</volume>
          (
          <year>2005</year>
          )
          <fpage>53</fpage>
          -
          <lpage>59</lpage>
          . URL: https: p.
          <fpage>529</fpage>
          -
          <lpage>534</lpage>
          . doi:
          <volume>10</volume>
          .1109/CISIS.
          <year>2013</year>
          .
          <volume>96</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          //doi.org/10.1016/j.tics.
          <year>2004</year>
          .
          <volume>12</volume>
          .002. doi:
          <volume>10</volume>
          .1016/ [30]
          <string-name>
            <given-names>D.</given-names>
            <surname>Połap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Woźniak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , E. Tramontana,
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          j.tics.
          <year>2004</year>
          .
          <volume>12</volume>
          .002.
          <article-title>Real-time cloud-based game management system</article-title>
          [19]
          <string-name>
            <given-names>D. V. M.</given-names>
            <surname>Bishop</surname>
          </string-name>
          ,
          <article-title>Ten questions about terminology via cuckoo search algorithm</article-title>
          ,
          <source>International Journal</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>for children with unexplained language problems</article-title>
          ,
          <source>of Electronics and Telecommunications</source>
          <volume>61</volume>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Int. J.</given-names>
            <surname>Lang</surname>
          </string-name>
          . Commun. Disord.
          <volume>49</volume>
          (
          <year>2014</year>
          )
          <fpage>381</fpage>
          -
          <lpage>415</lpage>
          .
          <fpage>333</fpage>
          -
          <lpage>338</lpage>
          . doi:
          <volume>10</volume>
          .1515/eletel-2015-
          <volume>0043</volume>
          . [20]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Beitchman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wilson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. B.</given-names>
            <surname>Brownlie</surname>
          </string-name>
          , H. Wal- [31]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mushtaq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-F.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.-V.</given-names>
            <surname>Tran</surname>
          </string-name>
          , Spectral im-
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <article-title>and social outcomes</article-title>
          ,
          <source>J. Am. Acad. Child Adolesc. Applied Acoustics</source>
          <volume>172</volume>
          (
          <year>2021</year>
          )
          <article-title>107581</article-title>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <source>Psychiatry</source>
          <volume>35</volume>
          (
          <year>1996</year>
          )
          <fpage>815</fpage>
          -
          <lpage>825</lpage>
          . https://www.sciencedirect.com/science/article/pii/ [21]
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Stanton-Chapman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Justice</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Skibbe</surname>
          </string-name>
          , S0003682X2030685X. doi:https://doi.org/10.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Grant</surname>
          </string-name>
          , Social and behavioral charac- 1016/j.apacoust.
          <year>2020</year>
          .
          <volume>107581</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <article-title>teristics of preschoolers with specific language</article-title>
          [32]
          <string-name>
            <given-names>M.</given-names>
            <surname>Woźniak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Połap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , E. Tramontana,
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <source>cial Education</source>
          <volume>27</volume>
          (
          <year>2007</year>
          )
          <fpage>98</fpage>
          -
          <lpage>109</lpage>
          . URL: https://doi. cuckoo search algorithm,
          <source>Expert Systems with Ap-</source>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <source>org/10</source>
          .1177/02711214070270020501. doi:
          <volume>10</volume>
          .1177/ plications 66 (
          <year>2016</year>
          )
          <fpage>20</fpage>
          -
          <lpage>31</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.eswa.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          02711214070270020501.
          <year>2016</year>
          .
          <volume>08</volume>
          .068. [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wagner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Seiderer</surname>
          </string-name>
          , E. André, Deep [33]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dobriban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , Invariance reduces
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <article-title>hand-crafted features still relevant?, in: Inter- learning and beyond</article-title>
          , ArXiv abs/
          <year>1907</year>
          .10905 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>speech</surname>
          </string-name>
          ,
          <year>2018</year>
          . URL: https://api.semanticscholar.org/ URL: https://api.semanticscholar.org/CorpusID:
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>CorpusID:52192644</source>
          .
          <fpage>198895147</fpage>
          . [23]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tokozume</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Harada</surname>
          </string-name>
          , Learning environmental [34]
          <string-name>
            <given-names>C.</given-names>
            <surname>Shorten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Khoshgoftaar</surname>
          </string-name>
          , A survey
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          work, in: 2017
          <source>IEEE International Conference on Journal of Big Data</source>
          <volume>6</volume>
          (
          <year>2019</year>
          )
          <article-title>60</article-title>
          . URL: https://
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>Acoustics</surname>
          </string-name>
          ,
          <article-title>Speech and Signal Processing (ICASSP), doi</article-title>
          .org/10.1186/s40537-019-0197-0. doi:
          <volume>10</volume>
          .1186/
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <year>2017</year>
          , pp.
          <fpage>2721</fpage>
          -
          <lpage>2725</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICASSP.
          <year>2017</year>
          . s40537-
          <fpage>019</fpage>
          -0197-0.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          7952651. [35]
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pappalardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tramontana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Marsza</surname>
          </string-name>
          [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. L.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nam</surname>
          </string-name>
          , Samplecnn: End- lek, D. Polap,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wozniak</surname>
          </string-name>
          , Simplified firefly algo-
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <article-title>to-end deep convolutional neural networks using rithm for 2d image key-points search</article-title>
          , in: IEEE SSCI
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <article-title>very small filters for music classification</article-title>
          ,
          <source>Applied</source>
          <year>2014</year>
          - 2014 IEEE Symposium Series on Computa-
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <source>Sciences</source>
          <volume>8</volume>
          (
          <year>2018</year>
          ). URL: https://www.mdpi.com/ tional Intelligence - CIHLI
          <year>2014</year>
          : 2014 IEEE Sym-
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          2076-
          <fpage>3417</fpage>
          /8/1/150. doi:
          <volume>10</volume>
          .3390/app8010150. posium on Computational Intelligence for Human[25]
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Justice</surname>
          </string-name>
          , W.-Y. Ahn,
          <string-name>
            <given-names>J. A. R.</given-names>
            <surname>Logan</surname>
          </string-name>
          , Identifying Like Intelligence, Proceedings,
          <year>2014</year>
          . doi:
          <volume>10</volume>
          .1109/
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <article-title>children with clinical language disorder: An appli-</article-title>
          <source>CIHLI</source>
          .
          <year>2014</year>
          .
          <volume>7013395</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <article-title>cation of machine-learning classification</article-title>
          , J. Learn. [36]
          <string-name>
            <given-names>A.</given-names>
            <surname>Greco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Petkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Saggese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vento</surname>
          </string-name>
          , Aren:
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <article-title>A deep learning approach for sound event recog-</article-title>
          [45]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Leng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <volume>15</volume>
          (
          <year>2020</year>
          )
          <fpage>3610</fpage>
          -
          <lpage>3624</lpage>
          . doi:
          <volume>10</volume>
          .1109/TIFS.
          <year>2020</year>
          . tems
          <volume>195</volume>
          (
          <year>2020</year>
          )
          <article-title>105600</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.knosys.
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          2994740.
          <year>2020</year>
          .
          <volume>105600</volume>
          . [37]
          <string-name>
            <given-names>D.</given-names>
            <surname>Połap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Woźniak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Napoli</surname>
          </string-name>
          , E. Tramontana, [46]
          <string-name>
            <given-names>V. M.</given-names>
            <surname>Praseetha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. P.</given-names>
            <surname>Joby</surname>
          </string-name>
          , Speech emotion recog-
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          <article-title>ognize graphic objects?</article-title>
          ,
          <source>Communications in Com- Journal of Speech Technology</source>
          <volume>25</volume>
          (
          <year>2022</year>
          )
          <fpage>783</fpage>
          -
          <lpage>792</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <source>puter and Information Science</source>
          <volume>538</volume>
          (
          <year>2015</year>
          )
          <fpage>376</fpage>
          -
          <lpage>387</lpage>
          . URL: https://doi.org/10.1007/s10772-021-09883-3.
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          <source>doi:10</source>
          .1007/978-3-
          <fpage>319</fpage>
          -24770-0_
          <fpage>33</fpage>
          . doi:
          <volume>10</volume>
          .1007/s10772-021-09883-3. [38]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mushtaq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-F.</given-names>
            <surname>Su</surname>
          </string-name>
          , Environmental sound [47]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lalitha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zakariah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. A.</given-names>
            <surname>Alotaibi</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <source>Applied Acoustics</source>
          <volume>167</volume>
          (
          <year>2020</year>
          )
          <article-title>107389</article-title>
          .
          <string-name>
            <surname>URL</surname>
          </string-name>
          <article-title>: data augmentation</article-title>
          ,
          <source>Applied Acoustics 170</source>
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          https://www.sciencedirect.com/science/article/pii/ (
          <year>2020</year>
          )
          <article-title>107519</article-title>
          . URL: https://www.sciencedirect.
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>S0003682X2030493X. doi:https://doi.org/10. com/science/article/pii/S0003682X2030623X.</mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          1016/j.apacoust.
          <year>2020</year>
          .
          <volume>107389</volume>
          . doi:https://doi.org/10.1016/j.apacoust. [39]
          <string-name>
            <given-names>O.</given-names>
            <surname>Novotny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Plchot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Glembek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Cer</surname>
          </string-name>
          -
          <year>2020</year>
          .107519.
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          <string-name>
            <surname>nocky</surname>
          </string-name>
          , L. Burget,
          <article-title>Analysis of dnn speech sig-</article-title>
          [48]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schmitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Janott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Pandit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Qian</surname>
          </string-name>
          , C. Heiser,
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          <source>Computer Speech and Language</source>
          <volume>58</volume>
          (
          <year>2019</year>
          )
          <article-title>403- approach for snore sounds' excitation localisation,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          421. URL: https://www.sciencedirect.com/science/ in: Speech Communication;
          <fpage>12</fpage>
          . ITG Symposium,
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          article/pii/S0885230818303607. doi:https://doi. 2016, pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          <source>org/10</source>
          .1016/j.csl.
          <year>2019</year>
          .
          <volume>06</volume>
          .004. [49]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lachambre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ricaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Stempfel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Torrésani</surname>
          </string-name>
          , [40]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mertes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Baird</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wiesmeyr</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          Onchis-Moaca, Optimal window
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          <article-title>proach for audio data augmentation, in: 2020 IEEE analysis</article-title>
          ,
          <source>in: 2015 17th International Symposium</source>
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          <source>22nd International Workshop on Multimedia Signal on Symbolic and Numeric Algorithms for Scientific</source>
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          <source>Processing (MMSP)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1109/
          <string-name>
            <surname>Computing</surname>
          </string-name>
          (SYNASC),
          <year>2015</year>
          , pp.
          <fpage>109</fpage>
          -
          <lpage>112</lpage>
          . doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          <string-name>
            <surname>MMSP48831.</surname>
          </string-name>
          <year>2020</year>
          .
          <volume>9287156</volume>
          . 1109/SYNASC.
          <year>2015</year>
          .
          <volume>25</volume>
          . [41]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C.-C. Chiu, [50]
          <string-name>
            <given-names>E.</given-names>
            <surname>Garcia-Ceja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Kvernberg</surname>
          </string-name>
          , J. Tor-
        </mixed-citation>
      </ref>
      <ref id="ref59">
        <mixed-citation>
          <source>speech</source>
          <year>2019</year>
          , ISCA,
          <year>2019</year>
          . URL: https://doi.org/10.
          <string-name>
            <surname>User-Adapted</surname>
            <given-names>Interaction</given-names>
          </string-name>
          30 (
          <year>2020</year>
          )
          <fpage>365</fpage>
          -
          <lpage>393</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref60">
        <mixed-citation>
          <volume>21437</volume>
          %
          <fpage>2Finterspeech</fpage>
          .
          <fpage>2019</fpage>
          -
          <lpage>2680</lpage>
          . doi:
          <volume>10</volume>
          .21437/ https://doi.org/10.1007/s11257-019-09248-1. doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref61">
        <mixed-citation>
          <string-name>
            <surname>interspeech.</surname>
          </string-name>
          <year>2019</year>
          -
          <volume>2680</volume>
          . 1007/s11257-019-09248-1. [42]
          <string-name>
            <given-names>E. K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-M. Chen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Kumari</surname>
            ,
            <given-names>J. J.</given-names>
          </string-name>
          [51]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ykhlef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ykhlef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chiboub</surname>
          </string-name>
          , Experimental
        </mixed-citation>
      </ref>
      <ref id="ref62">
        <mixed-citation>
          <article-title>ternet of things dialog system, Mobile Net- tems: Case studies</article-title>
          , in: 2019 6th International Con-
        </mixed-citation>
      </ref>
      <ref id="ref63">
        <mixed-citation>
          <source>works and Applications</source>
          <volume>27</volume>
          (
          <year>2022</year>
          )
          <fpage>158</fpage>
          -
          <lpage>171</lpage>
          .
          <article-title>URL: ference on Image and Signal Processing</article-title>
          and their
        </mixed-citation>
      </ref>
      <ref id="ref64">
        <mixed-citation>
          https://doi.org/10.1007/s11036-020-01638-9. doi:10.
          <string-name>
            <surname>Applications</surname>
          </string-name>
          (ISPA),
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1109/
        </mixed-citation>
      </ref>
      <ref id="ref65">
        <mixed-citation>
          <source>1007/s11036-020-01638-9. ISPA48434</source>
          .
          <year>2019</year>
          .
          <volume>8966798</volume>
          . [43]
          <string-name>
            <given-names>T.</given-names>
            <surname>Koike</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Schuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yamamoto</surname>
          </string-name>
          , [52]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lalitha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zakariah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. A.</given-names>
            <surname>Alotaibi</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref66">
        <mixed-citation>
          ifcation, in: 2021 43rd
          <string-name>
            <given-names>Annual</given-names>
            <surname>International</surname>
          </string-name>
          Confer- data augmentation,
          <source>Applied Acoustics 170</source>
        </mixed-citation>
      </ref>
      <ref id="ref67">
        <mixed-citation>
          <article-title>ence of the IEEE Engineering in Medicine &amp; Biology (</article-title>
          <year>2020</year>
          )
          <article-title>107519</article-title>
          . URL: https://www.sciencedirect.
        </mixed-citation>
      </ref>
      <ref id="ref68">
        <mixed-citation>
          <string-name>
            <surname>Society</surname>
          </string-name>
          (EMBC), IEEE,
          <year>2021</year>
          . com/science/article/pii/S0003682X2030623X. [44]
          <string-name>
            <given-names>V.</given-names>
            <surname>Basu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rana</surname>
          </string-name>
          , Respiratory diseases recognition doi:https://doi.org/10.1016/j.apacoust.
        </mixed-citation>
      </ref>
      <ref id="ref69">
        <mixed-citation>
          <article-title>through respiratory sound with the help of deep</article-title>
          <year>2020</year>
          .
          <volume>107519</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref70">
        <mixed-citation>
          <article-title>neural network</article-title>
          , in: 2020 4th International Confer- [53] V.
          <string-name>
            <surname>-T. Tran</surname>
            ,
            <given-names>W.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Tsai</surname>
          </string-name>
          , Stethoscope-sensed speech
        </mixed-citation>
      </ref>
      <ref id="ref71">
        <mixed-citation>
          <source>(CINE)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1109/CINE48825.
          <article-title>sparse training data</article-title>
          ,
          <source>IEEE Sensors Journal</source>
          <volume>20</volume>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref72">
        <mixed-citation>
          <year>2020</year>
          .
          <volume>234388</volume>
          .
          <fpage>848</fpage>
          -
          <lpage>859</lpage>
          . doi:
          <volume>10</volume>
          .1109/JSEN.
          <year>2019</year>
          .
          <volume>2945364</volume>
          . [54]
          <string-name>
            <surname>T. A. M. Celin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Nagarajan</surname>
          </string-name>
          , P. Vijayalakshmi, [63]
          <string-name>
            <surname>T. M. Stoumpos</surname>
            <given-names>AI</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kitsios</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Digital</surname>
          </string-name>
          transforma-
        </mixed-citation>
      </ref>
      <ref id="ref73">
        <mixed-citation>
          <article-title>synthesis and multi-resolution feature extraction applications</article-title>
          ,
          <source>Int J Environ Res Public Health</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref74">
        <mixed-citation>
          <article-title>for isolated word dysarthric speech recognition</article-title>
          ,
          <source>doi:10</source>
          .3390/ijerph20043407.
        </mixed-citation>
      </ref>
      <ref id="ref75">
        <mixed-citation>
          <source>IEEE Journal of Selected</source>
          Topics in Signal Process- [64]
          <string-name>
            <given-names>G.</given-names>
            <surname>Gopal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Suter-Crazzolara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Toldo</surname>
          </string-name>
          , W. Eber-
        </mixed-citation>
      </ref>
      <ref id="ref76">
        <mixed-citation>
          <source>ing 14</source>
          (
          <year>2020</year>
          )
          <fpage>346</fpage>
          -
          <lpage>354</lpage>
          . doi:
          <volume>10</volume>
          .1109/JSTSP.
          <year>2020</year>
          . hardt, Digital transformation in healthcare -
        </mixed-citation>
      </ref>
      <ref id="ref77">
        <mixed-citation>
          2972161. architectures of present and future information [55]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kobayashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Murakawa</surname>
          </string-name>
          ,
          <article-title>Urban sound technologies, Clinical Chemistry</article-title>
          and Labora-
        </mixed-citation>
      </ref>
      <ref id="ref78">
        <mixed-citation>
          <article-title>event classification based on local and global tory Medicine (CCLM) 57 (</article-title>
          <year>2019</year>
          )
          <fpage>328</fpage>
          -
          <lpage>335</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref79">
        <mixed-citation>features aggregation, Applied Acoustics 117 https://doi.org/10.1515/cclm-2018-0658. doi:doi:</mixed-citation>
      </ref>
      <ref id="ref80">
        <mixed-citation>
          (
          <year>2017</year>
          )
          <fpage>246</fpage>
          -
          <lpage>256</lpage>
          . URL: https://www.sciencedirect.
          <volume>10</volume>
          .1515/cclm-2018-0658.
        </mixed-citation>
      </ref>
      <ref id="ref81">
        <mixed-citation>
          com/science/article/pii/S0003682X16302274. [65]
          <article-title>Wave file format specification</article-title>
          , ???? URL:
        </mixed-citation>
      </ref>
      <ref id="ref82">
        <mixed-citation>doi:https://doi.org/10.1016/j.apacoust. https://www.mmsp.ece.mcgill.ca/Documents/</mixed-citation>
      </ref>
      <ref id="ref83">
        <mixed-citation>
          <year>2016</year>
          .
          <volume>08</volume>
          .002, acoustics in Smart Cities. AudioFormats/WAVE/WAVE.html. [56]
          <string-name>
            <given-names>V.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Vatanparvar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Nemati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Nathan</surname>
          </string-name>
          , [66]
          <string-name>
            <given-names>A.</given-names>
            <surname>Baevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          , M. Auli,
        </mixed-citation>
      </ref>
      <ref id="ref84">
        <mixed-citation>
          <string-name>
            <surname>M. M. Rahman</surname>
          </string-name>
          , J. Kuang,
          <source>CoughGAN: Generating Wav2vec 2</source>
          .
          <article-title>0: A framework for self-supervised</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref85">
        <mixed-citation>
          <source>classification()</source>
          ,
          <source>Annu Int Conf IEEE Eng Med Biol ings of the 34th International Conference on Neural</source>
        </mixed-citation>
      </ref>
      <ref id="ref86">
        <mixed-citation>
          <source>Soc</source>
          <year>2020</year>
          (
          <year>2020</year>
          )
          <fpage>5682</fpage>
          -
          <lpage>5688</lpage>
          .
          <source>Information Processing Systems</source>
          , NIPS'20,
          <string-name>
            <surname>Curran</surname>
            [57]
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Yella</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rajan</surname>
          </string-name>
          ,
          <article-title>Data augmentation using gan for Associates Inc</article-title>
          .,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref87">
        <mixed-citation>
          <article-title>sound based covid 19 diagnosis</article-title>
          , in: 2021 11th IEEE [67]
          <string-name>
            <given-names>V.</given-names>
            <surname>Panayotov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Povey</surname>
          </string-name>
          , S. Khudanpur,
        </mixed-citation>
      </ref>
      <ref id="ref88">
        <mixed-citation>
          <article-title>sition and Advanced Computing Systems: Technol- main audio books</article-title>
          , in: 2015 IEEE International Con-
        </mixed-citation>
      </ref>
      <ref id="ref89">
        <mixed-citation>
          <source>ogy and Applications (IDAACS)</source>
          , volume
          <volume>2</volume>
          ,
          <year>2021</year>
          , ference on Acoustics, Speech and Signal Process-
        </mixed-citation>
      </ref>
      <ref id="ref90">
        <mixed-citation>
          pp.
          <fpage>606</fpage>
          -
          <lpage>609</lpage>
          . doi:
          <volume>10</volume>
          .1109/IDAACS53288.
          <year>2021</year>
          .
          <source>ing (ICASSP)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>5206</fpage>
          -
          <lpage>5210</lpage>
          . doi:
          <volume>10</volume>
          .1109/
        </mixed-citation>
      </ref>
      <ref id="ref91">
        <mixed-citation>
          9660990.
          <string-name>
            <surname>ICASSP</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <volume>7178964</volume>
          . [58]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , Neural network prediction of sound [68]
          <string-name>
            <given-names>T.</given-names>
            <surname>Grósz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Porjazovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Getman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kadiri</surname>
          </string-name>
          , M. Ku-
        </mixed-citation>
      </ref>
      <ref id="ref92">
        <mixed-citation>
          <article-title>quality via domain knowledge-based data augmen- rimo, Wav2vec2-based paralinguistic systems to</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref93">
        <mixed-citation>
          <source>Mechanical Systems and Signal Processing 157 Proceedings of the 30th ACM International Con-</source>
        </mixed-citation>
      </ref>
      <ref id="ref94">
        <mixed-citation>
          (
          <year>2021</year>
          )
          <article-title>107713</article-title>
          . URL: https://www.sciencedirect.com/ ference on Multimedia, MM '22, Association for
        </mixed-citation>
      </ref>
      <ref id="ref95">
        <mixed-citation>
          science/article/pii/S0888327021001084. doi:https: Computing Machinery, New York, NY, USA,
          <year>2022</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref96">
        <mixed-citation>
          //doi.org/10.1016/j.ymssp.
          <year>2021</year>
          .
          <volume>107713</volume>
          . p.
          <fpage>7026</fpage>
          -
          <lpage>7029</lpage>
          . URL: https://doi.org/10.1145/3503161. [59]
          <string-name>
            <given-names>D.</given-names>
            <surname>Koszewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kostek</surname>
          </string-name>
          , Musical instrument tag-
          <volume>3551572</volume>
          . doi:
          <volume>10</volume>
          .1145/3503161.3551572.
        </mixed-citation>
      </ref>
      <ref id="ref97">
        <mixed-citation>
          <article-title>ging using data augmentation and efective noisy</article-title>
          [69]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wumaier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guo</surname>
          </string-name>
          , Automatic
        </mixed-citation>
      </ref>
      <ref id="ref98">
        <mixed-citation>
          <article-title>data processing</article-title>
          ,
          <source>Journal of the Audio Engineer- speech disfluency detection using wav2vec2. 0 for</source>
        </mixed-citation>
      </ref>
      <ref id="ref99">
        <mixed-citation>
          <source>ing Society</source>
          <volume>68</volume>
          (
          <year>2020</year>
          )
          <fpage>57</fpage>
          -
          <lpage>65</lpage>
          . doi:
          <volume>10</volume>
          .17743/jaes. diferent
          <article-title>languages with variable lengths</article-title>
          , Applied
        </mixed-citation>
      </ref>
      <ref id="ref100">
        <mixed-citation>
          <source>2019.0050. Sciences</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <fpage>7579</fpage>
          . [60]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Han,
          <string-name>
            <given-names>K.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Janott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          , [70]
          <string-name>
            <given-names>I.</given-names>
            <surname>Jordal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tamazian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. T.</given-names>
            <surname>Chourdakis</surname>
          </string-name>
          , An-
        </mixed-citation>
      </ref>
      <ref id="ref101">
        <mixed-citation>
          <source>IEEE J Biomed Health Inform</source>
          <volume>24</volume>
          (
          <year>2019</year>
          )
          <fpage>300</fpage>
          -
          <lpage>310</lpage>
          . K. Choi, MarvinLvn, SolomidHero, T. Alum, [61]
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Seastedt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Schwab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. O</given-names>
            <surname>'Brien</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Wakida</surname>
          </string-name>
          , iver56/audiomentations: v0.
          <fpage>33</fpage>
          .0 (
          <issue>?</issue>
          ???). doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref102">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. G. F.</given-names>
            <surname>Marcelo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Agha-Mir-Salim</surname>
          </string-name>
          ,
          <string-name>
            <surname>X. B.</surname>
          </string-name>
          5281/zenodo.7010042.
        </mixed-citation>
      </ref>
      <ref id="ref103">
        <mixed-citation>
          <string-name>
            <surname>Frigola</surname>
            ,
            <given-names>E. B.</given-names>
          </string-name>
          <string-name>
            <surname>Ndulue</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Marcelo</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          <string-name>
            <surname>Celi</surname>
            , Global [71]
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Richardson</surname>
            , I. Cook,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Crane</surname>
          </string-name>
          , D. Dun-
        </mixed-citation>
      </ref>
      <ref id="ref104">
        <mixed-citation>
          <article-title>not less, data</article-title>
          ,
          <source>PLOS Digital Health</source>
          <volume>1</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          . Grünfeld,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ooms</surname>
          </string-name>
          , Apache Arrow,
          <fpage>ar</fpage>
          -
        </mixed-citation>
      </ref>
      <ref id="ref105">
        <mixed-citation>URL: https://doi.org/10.1371/journal.pdig.0000102. row: Integration to 'Apache' 'Arrow',</mixed-citation>
      </ref>
      <ref id="ref106">
        <mixed-citation>
          <source>doi:10.1371/journal.pdig.0000102</source>
          .
          <year>2023</year>
          . Https://github.com/apache/arrow/, [62]
          <string-name>
            <given-names>M.</given-names>
            <surname>Paul</surname>
          </string-name>
          , L. Maglaras,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Ferrag</surname>
          </string-name>
          , I. Almomani, https://arrow.apache.org/docs/r/.
        </mixed-citation>
      </ref>
      <ref id="ref107">
        <mixed-citation>
          <article-title>Digitization of healthcare sector: A study on</article-title>
          [72]
          <string-name>
            <given-names>B.</given-names>
            <surname>Truax</surname>
          </string-name>
          ,
          <article-title>Handbook for acoustic ecology</article-title>
          , Leonardo
        </mixed-citation>
      </ref>
      <ref id="ref108">
        <mixed-citation>
          <article-title>privacy and security concerns</article-title>
          ,
          <source>ICT Express 9</source>
          (
          <year>2023</year>
          )
          <volume>13</volume>
          (
          <year>1980</year>
          )
          <fpage>83</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref109">
        <mixed-citation>
          571-
          <fpage>588</fpage>
          . URL: https://www.sciencedirect.com/ [73]
          <string-name>
            <given-names>R. N.</given-names>
            <surname>Bracewell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. N.</given-names>
            <surname>Bracewell</surname>
          </string-name>
          ,
          <source>The Fourier trans-</source>
        </mixed-citation>
      </ref>
      <ref id="ref110">
        <mixed-citation>
          science/article/pii/S2405959523000243. doi:https: form
          <article-title>and its applications</article-title>
          , volume
          <volume>31999</volume>
          ,
          <fpage>McGraw</fpage>
          -
        </mixed-citation>
      </ref>
      <ref id="ref111">
        <mixed-citation>
          //doi.org/10.1016/j.icte.
          <year>2023</year>
          .
          <volume>02</volume>
          .007. Hill New York,
          <year>1986</year>
          . [74]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cortes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vapnik</surname>
          </string-name>
          ,
          <article-title>Support-vector networks</article-title>
          , [87]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Piotr</surname>
          </string-name>
          <string-name>
            <surname>Dabkowski</surname>
          </string-name>
          , Eleven labs,
          <year>2023</year>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref112">
        <mixed-citation>
          <source>Machine learning 20</source>
          (
          <year>1995</year>
          )
          <fpage>273</fpage>
          -
          <lpage>297</lpage>
          . //elevenlabs.io/voice-lab. [75]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          , Deep resid-
        </mixed-citation>
      </ref>
      <ref id="ref113">
        <mixed-citation>
          <article-title>ual learning for image recognition</article-title>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref114">
        <mixed-citation>
          <source>arXiv:1512</source>
          .
          <fpage>03385</fpage>
          . [76]
          <string-name>
            <given-names>T. W.</given-names>
            <surname>Clément</surname>
          </string-name>
          <string-name>
            <surname>Delangue</surname>
          </string-name>
          , Julien Chaumond,
        </mixed-citation>
      </ref>
      <ref id="ref115">
        <mixed-citation>
          <string-name>
            <surname>Wav2vec2</surname>
          </string-name>
          ,
          <year>2022</year>
          . URL: https://huggingface.co/docs/
        </mixed-citation>
      </ref>
      <ref id="ref116">
        <mixed-citation>
          transformers/model_doc/wav2vec2. [77]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref117">
        <mixed-citation>
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref118">
        <mixed-citation>
          2825-
          <fpage>2830</fpage>
          . [78]
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Justice</surname>
          </string-name>
          , W.-Y. Ahn,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Logan</surname>
          </string-name>
          , Identifying
        </mixed-citation>
      </ref>
      <ref id="ref119">
        <mixed-citation>
          <source>of learning disabilities 52</source>
          (
          <year>2019</year>
          )
          <fpage>351</fpage>
          -
          <lpage>365</lpage>
          . [79]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gaspers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Thiele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Foltz</surname>
          </string-name>
          , P. Sten-
        </mixed-citation>
      </ref>
      <ref id="ref120">
        <mixed-citation>
          <article-title>learning techniques</article-title>
          ,
          <source>in: Proceedings of the 2nd</source>
        </mixed-citation>
      </ref>
      <ref id="ref121">
        <mixed-citation>
          <string-name>
            <surname>sium</surname>
          </string-name>
          ,
          <year>2012</year>
          , pp.
          <fpage>209</fpage>
          -
          <lpage>218</lpage>
          . [80]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kanimozhiselvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Santhiya</surname>
          </string-name>
          , Communication
        </mixed-citation>
      </ref>
      <ref id="ref122">
        <mixed-citation>2021 Third International Conference on Intelligent</mixed-citation>
      </ref>
      <ref id="ref123">
        <mixed-citation>
          <article-title>Networks (ICICV)</article-title>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>789</fpage>
          -
          <lpage>793</lpage>
          . [81]
          <string-name>
            <given-names>A.</given-names>
            <surname>Caliskan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Bryson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          , Semantics
        </mixed-citation>
      </ref>
      <ref id="ref124">
        <mixed-citation>
          <article-title>tain human-like biases</article-title>
          ,
          <source>Science</source>
          <volume>356</volume>
          (
          <year>2017</year>
          )
          <fpage>183</fpage>
          -
          <lpage>186</lpage>
          . [82]
          <string-name>
            <given-names>D.</given-names>
            <surname>Rozado</surname>
          </string-name>
          ,
          <article-title>The political biases of ChatGPT, Soc</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref125">
        <mixed-citation>
          <string-name>
            <surname>Sci.</surname>
          </string-name>
          (Basel)
          <volume>12</volume>
          (
          <year>2023</year>
          )
          <fpage>148</fpage>
          . [83]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Gianfrancesco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tamang</surname>
          </string-name>
          , J. Yazdany,
        </mixed-citation>
      </ref>
      <ref id="ref126">
        <mixed-citation>
          <string-name>
            <given-names>JAMA</given-names>
            <surname>Intern</surname>
          </string-name>
          .
          <source>Med</source>
          .
          <volume>178</volume>
          (
          <year>2018</year>
          )
          <fpage>1544</fpage>
          -
          <lpage>1547</lpage>
          . [84]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ibrahim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zarifa</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. D.</surname>
          </string-name>
          Mor-
        </mixed-citation>
      </ref>
      <ref id="ref127">
        <mixed-citation>
          <string-name>
            <surname>care</surname>
          </string-name>
          ,
          <source>The Lancet Digital Health</source>
          <volume>3</volume>
          (
          <year>2021</year>
          )
          <fpage>e260</fpage>
          -
        </mixed-citation>
      </ref>
      <ref id="ref128">
        <mixed-citation>e265. URL: https://www.sciencedirect.com/science/</mixed-citation>
      </ref>
      <ref id="ref129">
        <mixed-citation>article/pii/S2589750020303174. doi:https://doi.</mixed-citation>
      </ref>
      <ref id="ref130">
        <mixed-citation>
          <source>org/10</source>
          .1016/S2589-
          <volume>7500</volume>
          (
          <issue>20</issue>
          )
          <fpage>30317</fpage>
          -
          <lpage>4</lpage>
          . [85]
          <string-name>
            <given-names>P.</given-names>
            <surname>Grill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tučková</surname>
          </string-name>
          , Speech databases of typical
        </mixed-citation>
      </ref>
      <ref id="ref131">
        <mixed-citation>
          <article-title>children and children with SLI</article-title>
          ,
          <source>PLoS One</source>
          <volume>11</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref132">
        <mixed-citation>
          e0150365. [86]
          <string-name>
            <given-names>A.</given-names>
            <surname>McAllister</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sjölander</surname>
          </string-name>
          ,
          <article-title>Children's voice and</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref133">
        <mixed-citation>
          <string-name>
            <surname>guage</surname>
          </string-name>
          , volume
          <volume>34</volume>
          , Thieme Medical Publishers,
          <year>2013</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref134">
        <mixed-citation>
          pp.
          <fpage>071</fpage>
          -
          <lpage>079</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>