Speech and Language Impairment Detection by Means of
                                AI-Driven Audio-Based Techniques
                                Luca Corvitto1 , Lorenzo Faiella1 , Christian Napoli1 , Adriano Puglisi1 and Samuele Russo2
                                1
                                    Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, Roma, 00185, Italy
                                2
                                    Department of Psychology, Sapienza University of Rome, Via dei Marsi 78, Roma, 00185, Italy


                                                                          Abstract
                                                                          Speech and Language Impairments (SLI) affect a large and heterogeneous group of people. With our work, we propose a
                                                                          novel, easy, and immediate detection tool to help diagnose people who suffer from SLI using speech audio signals, along with
                                                                          a new dataset containing English speakers affected by SLI. In this work, we experiment with feature extraction methods
                                                                          such as Mel Spectrogram and wav2vec 2.0, as well as classification methods such as SVM, CNN, and linear neural networks.
                                                                          We also work on data audio augmentation trying to overcome the very common limitations imposed by data scarcity in the
                                                                          medical field. The overall results indicate that the wav2vec 2.0 feature extractor, paired with a linear classifier, provides the
                                                                          best performance with a reasonably high accuracy of over 96%.

                                                                          Keywords
                                                                          SLI, AI, audio, healthcare, speech, learning disease, feature extraction, data augmentation


                                1. Introduction                                                                                         sification systems. Automatic classification technologies
                                                                                                                                        are widely applied in voice assistants [8], chatbots [9],
                                The rapid development of the use of Artificial Intelligence smart safety devices [10, 11], and in different real-world
                                (AI) techniques in a broad range of scientific fields has environments [12, 13, 14].
                                helped solve real-life problems, in particular, the new ad-                                                Our project aims to conciliate these two worlds and
                                vancements revolutionized a wide variety of areas such as design a Deep Learning (DL) model that can detect, from
                                Natural Language Processing (NLP) [1], computer vision a given audio input, if the speaker could be affected by
                                [2, 3], robotics and many more. Due to the huge volume a speech and language impairment. Individuals with a
                                of medical data being generated worldwide, there is a Speech and Language Impairment (SLI), generally, de-
                                clear need for efficient use of this information to bene- spite normal hearing, normal nonverbal intelligence, ad-
                                fit health sectors around the world [4, 5]. The medical equate social functioning, and no obvious signs of brain
                                community has taken strong notice of the potential of injury represent a heterogeneous group of people with
                                these new technologies in AI. Machine learning (ML) significant difficulty in learning languages [15]. One of
                                thrives in areas where there are lots of data, therefore the defining characteristics of SLI is speech disfluency,
                                ML is one of the essential and most effective tools in more specifically impaired acquisition of pattern-based
                                analyzing highly complex medical data [6]. For example, components in language, such as morphology, syntax,
                                analyzing medical data originating from disease diagno- and some aspects of phonology such as stuttering. This
                                sis with the aid and benefits given by these tools could commonly used definition leads to early hypotheses re-
                                be a lot more financially efficient. In healthcare, it is also garding the etiology of SLI that an impaired language-
                                vital that diseases are detected early on during diagnosis specific learning mechanism underlies language develop-
                                and prognosis. The success of these AI methods has also ment and disorders [16, 17, 18]. This disorder is deemed
                                spread across other domains, including speech recogni- “primary” or “specific” when there is no clear explana-
                                tion and the music recommendation task [7]. Due to the tion for these lags in language skills, a defining charac-
                                relevance of such systems in our day-to-day lives, there teristic of primary language disorder is that its cause is
                                is an increasing need for effective and efficient audio clas- unknown [19]. Language disorders are also linked to
                                                                                                                                        a heightened risk for psychiatric concerns, attentional
                                ICYRIME 2024: 9th International Conference of Yearly Reports on
                                Informatics, Mathematics, and Engineering. Catania, July 29-August
                                                                                                                                        difficulties, social-behavioral problems, and learning dis-
                                1, 2024                                                                                                 abilities [20, 21]. Many current trends in audio signal pro-
                                $ corvitto.1835668@studenti.uniroma1.it (L. Corvitto);                                                  cessing rely on data-driven machine learning approaches
                                faiella.1835950@studenti.uniroma1.it (L. Faiella);                                                      to achieve state-of-the-art results [22, 23, 24]. However,
                                cnapoli@diag.uniroma1.it (C. Napoli); puglisi@diag.uniroma1.it                                          the quantity and quality of available data influences heav-
                                (A. Puglisi); samuele.russo@uniroma1.it (S. Russo)
                                                                                                                                        ily the achieved performance for a task. Depending on
                                 0000-0002-3336-5853 (C. Napoli); 0009-0007-6307-7194
                                (A. Puglisi); 0000-0002-1846-9996 (S. Russo)                                                            the specific task, as for our case study, such data can
                                           © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License often be hard to obtain and costly to label particularly in
                                           Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                                                 19


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Luca Corvitto et al. CEUR Workshop Proceedings                                                                     19–31


the audio domain. As a consequence, researchers often          reduction [33]. DA is key when dealing with problems
have to deal with datasets of insufficient size or quality.    regarding audio signals because the Convolutional Neu-
Usually, diagnosis of this type of problem is carried out      ral Network (CNN) is the most widely used model in
with human experts, with special in-loco tests [25, 26]        audio applications and when faced with small datasets,
or with the aid of tools such as electroencephalogram          CNN’s capacity for information retention becomes a flaw;
(EEG) [27]. We want to design an easy and accessible           the models memorize the training data and lose perfor-
model that can detect if a person could be affected by         mance on new data [34, 35]. In addition to increasing
an SLI without having to go through complex and time-          generalization capabilities, the augmentation of data also
consuming procedures. In this manner, such a model             allows the designed system to improve data significance,
could also be implemented in robots from a human-robot         regardless of the available data samples [36, 37]. These
interaction (HRI) perspective, allowing the machine to         strategies include methods on raw audio signals, as well
detect people with SLI and change its behavior and form        as applying other techniques on samples converted into
of interaction accordingly.                                    spectrograms or even more complex approaches such
   This study proposes an analysis of a novel, yet simple,     as interpolation and nonlinear mixing on the spectrum.
approach of using exclusively audio recordings for SLI         We will now list and briefly explain the most used audio
detection. Specifically, in Section 2 we start by exploring    augmentation techniques.
the current literature, and then we will talk about the        Pitch Shifting. The tone of each audio signal in the
problems faced in collecting our data and how we handled       dataset is lowered or raised by a factor preserving its
them in section 3. After that, in Section 4, we will go        duration.
through an analysis of the techniques and models used to       Time Stretching. The audio sample is slowed down or
perform the detection, the trials and results we obtained      sped up by a ratio without altering the pitch drastically.
from them in Section 5, and then we will discuss the           Time Shifting. Time is shifted to the left or to the right
limits of our approach in Section 6. We will finally draw      by a random factor or by a predetermined amount.
our conclusions in Section 7.                                  Volume Adjustment. The volume of the audio file is
                                                               altered, there is a change in loudness, or sometimes a
                                                               dynamic range compression is applied.
2. Related Works                                               Noise addition. Noise is introduced into the samples,
                                                               other than a simple random Gaussian noise there are
In the ever-evolving landscape of computer science and
                                                               many types of noises such as white noise [38], babble
artificial intelligence, the domains of audio data augmen-
                                                               noise, static noise [39], factory noise, etc.
tations and feature extraction are undergoing very rapid
                                                               SpeedUp. The signal is resampled at a preset sampling
changes and revolutions thanks to groundbreaking re-
                                                               rate and later returned at the original sampling rate, re-
search and advancements. In the following sections, we
                                                               sulting in a speed change.
will delve into the story and explore the state of the art
                                                               Filtering. Several kinds of filters are applied to the
of these fields.
                                                               input audio. Most of the common filters are band-pass,
                                                               band-stop, high-pass, high-shelf, low-pass, low-shelf, and
2.1. Audio Data Augmentation                                   peaking filters.
One of the most important challenges in developing an             This topic is so important that researchers also devel-
efficient and effective audio classification system is ac-     oped and designed methods that generate entirely new
cessing a large and well-annotated dataset. One of the         samples, for example with the aid of a Generative Adver-
main obstacles in developing sound classifications is a        sarial Network (GAN) in [40] people created new variants
lack of a sufficient quantity of labeled data. This is due     of the audio samples that already existed in their dataset
to the following main reasons: class imbalance, data pri-      and then utilized an evolutionary algorithm to search
vacy issues, time constraints involved in data collection,     the input domain to select the best-generated samples, in
high dependency on expertise for effective annotation,         this way they were able to generate audio in a controlled
etc. [28, 29, 30] Data Augmentation (DA) is defined as the     manner that contributed to an improvement in classifica-
creation of new data by adding deformations to increase        tion performance of the original task. One very recent
the variety of the data so that these deformations do not      DA method proposed by Google is SpecAugment [41],
change their semantic value. It is well known that DA can      in this method, the two-dimensional spectrum diagram
improve the algorithm’s performance, tackle the issue          is treated as an image with time on the horizontal axis
of overfitting [31, 32], and improve the generalization        and frequency on the vertical axis. Encoder-decoder net-
ability of Deep Neural Networks (DNN); this happens be-        works are becoming very popular in fields different from
cause DA averages over the orbits of the group that keeps      NLP, this is because they can convert a high-dimensional
the data distribution invariant, which leads to variance       input into a lower-dimensional vector in latent space,
                                                               researchers in [42] have experimented with a Long Short


                                                          20
Luca Corvitto et al. CEUR Workshop Proceedings                                                                  19–31


Term Memory (LSTM) based auto-encoder to produce            trons (MLP) were very useful in person identification us-
artificial data.                                            ing speech and breath sounds [53], Hidden Markov Mod-
                                                            els (HMM) [54], logistic regression and linear discrimi-
2.2. Audio Feature Extraction and Models nant analysis [55] and others. Some studies exploited the
                                                            effectiveness of multiple simpler methods with ensemble
It should be noted that data augmentation is not the only methods such as random forests [56, 51], XgBoost [57],
way to reduce overfitting and improve the generalization and so on. Unfortunately, considering the complexity
ability of DL models. Model structure optimization, trans- of sound and the need to sometimes train an extremely
fer learning, and One-shot and Zero-shot learning are sensitive classifier that can identify different represen-
also known strategies that deal with overfitting from dif- tations of sound features, traditional ML still suffers in
ferent aspects. We will now focus on the most common these kinds of tasks from having less complex models. In
processing flow of audio classification: preprocessing the this case, the choice of DL methods has been proven to be
original audio data, feature extraction, and feeding the more efficient. DL methods differ from traditional ones
features into the DL model. Audio signals have very high because they can extract meaningful features from data
dimensionality, so thousands of floating point values are through the application of a hierarchical structure [58]
required to represent a short audio signal, raising the CNNs were able to achieve significant and more accurate
need for exploring dimensionality reduction and feature training results [59]. People tried to combine the best of
extraction methods. The degree of how great or poor a these two worlds by implementing hybrid methods, for
model performs is also determined by the choice of fea- example, researchers merged an SVM and a GRU-RNN
tures used feature representation is crucial to improve the in [60].
performance of learning algorithms in the sound classifi-
cation task. One of the first features that comes to mind
when thinking of an audio signal is the spectrogram, its 3. Dataset
characteristics have been widely used by previous re-
searchers in different domains of sound classification, In the medical field, in particular, regarding specific prob-
such as heartbeat sounds to detect heart diseases [43]. lems such as the one presented in this paper, data is not
Another method used to extract features implements the always freely available or available at all. This is mostly
Mel-Frequency Cepstrum (MFC), which is a representa- due to privacy concerns [61, 62]. Another important rea-
tion of the short-term power spectrum of a sound, based son, which is also related in some ways to privacy [62],
on a linear cosine transform of a log power spectrum lies in the overall low level of digitization of healthcare
on a nonlinear mel scale of frequency, where the Mel- information [63]; in fact, according to Gopal G. et al.
Frequency Cepstral Coefficients (MFCC) were successful [64], healthcare has the lowest level of digital innova-
in representing sounds for the detection of respiratory tion compared to other industries, such as media, finance,
diseases [44]. Some methods that also use the MFC are insurance, and retail, contributing to limited growth of
the long-mel [45], mel filter bank energy [46], inverted labor productivity. In addition to this, it is also worth
MFCC [47], and many more. Although mel spectrogram noting that not every dataset containing the desired med-
and MFCC are commonly used, people also implement ical information is also in the desired format, in which
bag of audio words [48], Discrete Gabor Transform (DGT) case the only remaining option is to create an entirely
audio image representation [49], ZCR, entropy of energy, new dataset from scratch, that is what we did.
spectral centroid, spectral spread, spectral entropy [50],
and so on.                                                  3.1. Data Collection
   Classification is a common task in ML and pattern
                                                            The process of collecting audio data is a pivotal phase
recognition. DL methods applied in these tasks, such as
                                                            in this research. For our dataset, we aimed to collect a
CNN models, often do not perform as well as more tradi-
                                                            sufficient amount of pure, non-multimodal, audio data in
tional ML methods such as random forest, Adaboost, etc.,
                                                            a waveform representation. Audio data can be stored in
especially in small data [51]. On the other hand, typical
                                                            various formats, each with its characteristics, trade-offs,
ML algorithms, such as ensemble classifiers have been
                                                            and use cases. Common audio formats include Wave-
shown to learn features better and adapt more with im-
                                                            form Audio File Format (WAV), MPEG-1 Audio Layer
proved generalization abilities even in the case of small
                                                            3 (MP3), Free Lossless Audio Codec (FLAC), and more.
and imbalanced datasets. Over the past years, differ-
                                                            These formats differ in terms of compression, quality, and
ent ML algorithms have been used for detecting sound
                                                            compatibility. For this study, we opt for the WAV format
events and medical sounds, and the achieved results were
                                                            [65], which is an uncompressed audio file format, devel-
of great significance. Classifiers, such as Support Vec-
                                                            oped by IBM and Microsoft, that efficiently stores audio
tor Machine (SVM), have shown to be very effective in
                                                            data in a waveform representation without any loss of in-
sound classification tasks [52], also MultiLayer Percep-


                                                          21
Luca Corvitto et al. CEUR Workshop Proceedings                                                                        19–31


                                                                Table 1
                                                                Dataset samples

                                                                                             Train             Test
                                                                                      SLI      Healthy   SLI    Healthy
                                                                  Non-augmented       1010       1010    124      125
                                                                  Time-shifted         893       1010    104      125
                                                                  Time-stretched      1010       1010    124      125
Figure 1: Audio sample from our dataset                           Pitch-shifted       2020       2020    248      250
                                                                  Noise-addition      1010       1010    124      125
                                                                  Total               5943       6060    724      750
                                                                  Dataset                    12003             1474


                                                                formation from them, keeping just human speech sounds
                                                                (with or without background noise). After that, we an-
                                                                alyzed the time windows by dividing each one of them
                                                                into smaller ones containing the speech of one single
Figure 2: Audio sample with noise from our dataset              person each. Even if there exist different tools available
                                                                to detect human speech, considering the scarcity of data
                                                                we suffer, we decided to perform this step manually to
formation. Thanks to its characteristics, which guarantee       be sure that the quality of our dataset is not affected.
the highest amount of information for an audio signal,             Secondly we split the time windows that we obtained
WAV is the audio format used as input by wav2vec 2.0            in 3-second clips. We chose this length as a trade-off be-
[66], a state-of-the-art speech model developed by the          tween a sufficient length, to capture fluency information
Facebook AI Research group (FAIR) that is one of the            and a brief duration. Our decision was also based on the
models used in this work.                                       standard approach used in the state-of-the-art working
   The data collection process began with the identifi-         with wav2vec 2.0 in these kinds of tasks [68, 69]. Then
cation of audio samples containing English speakers af-         these clips were saved in two different subsets, creat-
fected by Speech and Language Impairment (SLI) orig-            ing the Train and the Test set, ensuring that the same
inating from different conditions. This diverse dataset         speakers do not overlap in both datasets.
was intentionally curated to optimize the performance              Finally the acquired data was augmented to increase
of SLI detection. By including speakers with a range of         its dimension. We applied the following audio augmen-
impairments, the model is exposed to a broad spectrum           tations techniques: Time shifting, Time stretching, Pitch
of speech patterns and anomalies, thereby enhancing its         shifting, and Noise addition, using Gaussian noise. To do
ability to accurately detect SLI in real-world applications.    so, we used the python library audiomentations [70]. For
To source such data, we turned to YouTube, a vast and           Time shifting we resampled the time windows shifting
user-friendly repository of video and audio content. The        the starting time further by 1.5 seconds; For Time stretch-
videos found were then converted into audio files in WAV        ing we slowed down the speed of the audios by a ratio
format using an online converter.                               of 0.8; For the Pitch shifting we both lowered and raised
   We finally paired the collected data with a subset of        the pitch tone by a value of 3, obtaining for each clip two
the LibriSpeech dataset [67] containing healthy English         additional ones; Finally for the Noise addition, we added
speakers only.                                                  a 0.01𝑚 amplitude Gaussian noise. Audio waveforms
                                                                before and after noise addition are shown in Fig. 1 and
                                                                Fig. 2. All the augmentation techniques were applied on
3.2. Data Preprocessing
                                                                the original audio; Time shifting was directly applied on
To feed the waveform signals to the model, we needed            the time windows, while the other ones on the initial 3
to ensure that they were appropriately prepared and pro-        seconds clips.
cessed. Effective data preprocessing is fundamental to             The number of samples in the created dataset is shown
enhancing the model’s performance, as it directly im-           in Table 1, while in Table 2 we collect the audio data
pacts the model’s ability to extract meaningful patterns        augmentation techniques used and their respective pa-
and insights from raw input data. This was performed            rameters.
in different steps. Firstly we identified different time
windows from each audio file to cut out unnecessary in-


                                                           22
Luca Corvitto et al. CEUR Workshop Proceedings                                                                      19–31


Table 2                                                         4.1. Log Mel Spectrogram
Pamateres used for augmentation methods
                                                             The way humans hear frequencies in sound is known as
        Augmentations            Parameters                  pitch, it is a subjective impression of the frequency. They
         Time-shifting        shift = +1.5 seconds           do not perceive frequencies linearly, on the contrary, hu-
         Time-stretching            ratio = 0.8              mans are more sensitive to differences between lower
         Pitch-shifting         shift = ± 3 tones            frequencies than higher ones. For example, the differ-
         Noise-addition     amplitude = 0.01 meters          ence between audios of frequency 100𝐻𝑧 and 200𝐻𝑧
                                                             is way bigger than 1000𝐻𝑧 and 1100𝐻𝑧, even though
                                                             the absolute difference is the same amount. Humans per-
                                                             ceive sounds on a logarithmic scale rather than a linear
                                                             scale. The Mel Scale [72] was developed to take this into
                                                             account by conducting experiments with a large number
                                                             of listeners. It is a scale of pitches, such that each unit
                                                             is judged by listeners to be equal in pitch distance from
                                                             the next. The human perception of the amplitude of a
                                                             sound is called loudness, similarly to frequency, also loud-
                                                             ness is heard logarithmically rather than linearly. The
                                                             Decibel scale is used to measure the loudness of a sound,
                                                             for example, a sound with an amplitude of 20𝐷𝑏 is 10
                                                             times louder than one with an amplitude of 10𝐷𝑏. We
                                                             can see that, to deal with sound realistically, we need to
                                                             use a logarithmic scale via the Mel Scale and the Decibel
                                                             Scale when dealing with Frequencies and Amplitudes in
                                                             our data. Spectrograms are generated from sound signals
                                                             using Fourier Transforms. A Fourier Transform (FT) [73]
Figure 3: Log Mel Spectrogram of a sample from our dataset is a mathematical formula that allows us to decompose
                                                             the signal into its constituent frequencies and displays
                                                             the amplitude of each frequency present in the signal.
3.3. Data Management                                         Spectrograms are generated from sound signals using
                                                             FTs. In other words, an FT converts the signal from the
The dataset contains audio files in the WAV format, its time domain into the frequency domain, and the result is
data is affected not only by its advantages but also by its called a spectrum. A spectrogram consists in dividing the
drawbacks. The complete dataset, which comprehends sound signal into smaller time segments, then applying
both original and augmented data, was too large to be the FT to each segment, and finally, the combination of
loaded in an online manner using the original files. To these segments in a single plot is called spectrogram. A
overcome this problem we loaded the data in batches Mel Spectrogram makes two important changes relative
and concatenated them in subsets that were saved in to a regular spectrogram that plots frequency vs time: it
the .arrow format [71], a columnar memory format for uses the Mel scale instead of frequency on the y-axis and
flat and hierarchical data, organized for efficient analytic uses the Decibel scale instead of amplitude to indicate
operations. In this way, large data can be saved, loaded, color. In Fig. 3 we can see a normalized version of the Mel
and processed avoiding memory usage problems.                spectrogram of one of the audios present in the dataset.

4. Models and Techniques Used                                   4.2. Wav2vec 2.0
The best way to approach a problem is to know deeply            Wav2vec 2.0 [66] is an exceptional tool that learns pow-
every factor that influences it and how the key compo-          erful representations from speech mimicking the human
nents work, after that, one can tackle it and try to capture    learning experience. People start, in fact, since the early
its essence with the maximum capabilities. In the follow-       stages of their lives comprehending language without la-
ing subsections, we present a brief description of the          beled data, i.e. kids learn from listening to adults around
techniques we used and the models we implemented.               them. It is also able to outperform state-of-the-art models
                                                                while using 100 times less labeled data, thus demonstrat-
                                                                ing the feasibility of training without huge amounts of
                                                                labeled data which is very hard to achieve in a field deal-
                                                                ing with a complex medium such as audio.


                                                           23
Luca Corvitto et al. CEUR Workshop Proceedings                                                                          19–31


                                                                 4.3. Classification Methods
                                                                 Classification is the part that stands out the most in an
                                                                 entire model because it outputs the labels that are used
                                                                 to compute the evaluation metrics, even though it is the
                                                                 most noticeable part of a model, in our case they are just
                                                                 the final piece of the puzzle since most of the work is
                                                                 done in the previous steps of the pipeline; still, we want
                                                                 to pay some attention to the type of classifiers we used
                                                                 in our work.
                                                                    Support Vector Machine (SVM) [74] is one of the
Figure 4: Wav2vec 2.0 pipeline                                   first algorithms learned by every ML expert, it is sim-
                                                                 ple yet it can achieve excellent results, especially with
                                                                 small amounts of data where other ML algorithms tend
                                                                 to have some difficulties. The objective of the support
   The model can be visualized in Fig. 4 and next, we will
                                                                 vector machine algorithm is to find a hyperplane in an
describe its components.
                                                                 N-dimensional space (𝑁 − the number of features) that
   Multi-layer convolutional feature encoder. It
                                                                 distinctly classifies the data points. To separate the two
consists of several blocks containing a temporal con-
                                                                 classes of data points, many possible hyperplanes could
volution followed by layer normalization and a GELU
                                                                 be chosen. SVM finds a plane that has the maximum mar-
activation function.
                                                                 gin, i.e. the maximum distance between data points of
   Context network. It follows the Transformer ar-
                                                                 both classes. Maximizing the margin distance provides
chitecture, differently from a normal Transformer that
                                                                 some reinforcement so that future data points can be
uses fixed positional embeddings, a convolutional layer
                                                                 classified with more confidence. The biggest difficulty
is used instead, and it acts as a relative positional embed-
                                                                 encountered when testing the SVM is that even with low
ding. The output of the convolution followed by a GELU
                                                                 amounts of data the model had memory issues, since au-
is added to the inputs and then a layer normalization is
                                                                 dio features are extremely large and with multiple classes,
applied.
                                                                 while SVM excels with data that has fewer classes, thus
   Quantization module. It discretizes the output of
                                                                 making it hard to fully exploit SVM’s strengths.
the feature encoder to a finite set of speech represen-
                                                                    One of the best and most efficient methods to generate
tations via product quantization. Product quantization
                                                                 labels from an ML model is adding a linear layer at the
amounts to choosing quantized representations from mul-
                                                                 end of the pipeline, that is what we did with our wav2vec
tiple codebooks and concatenating them. The Gumbel
                                                                 2.0 feature extractor, we have included a linear classifier
softmax enables choosing discrete codebook entries in a
                                                                 𝑓 (𝑥𝑖 , 𝑊, 𝑏) = 𝑊 · 𝑥𝑖 + 𝑏 and we trained its weights to
fully differentiable way.
                                                                 output two types of labels, one for people affected by a
   The feature encoder 𝑓 : 𝑋 → 𝑍 takes as input the raw
                                                                 SLI and one for the others.
waveform 𝑋 and outputs the latent speech representa-
                                                                    Resnet34 is a very famous residual neural network
tions 𝑧1 , ..., 𝑍𝑡 for 𝑇 time steps, then they are fed to the
                                                                 that was pre-trained on ImageNet-1k and was released
transformer 𝑔 : 𝑍 → 𝐶 that captures information from
                                                                 by Microsoft [75], thanks to residual learning and skip
the entire sequence and outputs context representations.
                                                                 connections this type of model can be much deeper than
The output of the feature encoder is also discretized to
                                                                 normal convolutional neural networks. We decided to
𝑞𝑡 with a quantization module to represent the targets
                                                                 fine-tune this model with the features extracted with the
in the self-supervised objective. During the model’s pre-
                                                                 log mel spectrogram from our dataset.
training a part of the latent speech representations that
are generated from the feature encoder are masked, and
then the model learns the representations of speech au-          5. Results
dio by solving a contrastive task, which requires iden-
tifying the true quantized latent speech representation          In this section, we will describe the different architectures
for a masked time step within a set of distractors. After        that we tested in detail and then we will comment on the
pre-training on unlabeled speech, the model is fine-tuned        obtained results.
on labeled data with a Connectionist Temporal Classifi-
cation (CTC) loss.                                               5.1. Architectures
                                                                 Our first approach was to use the wav2vec 2.0 model,
                                                                 in particular the pre-trained wav2vec2-base model from
                                                                 HuggingFace [76], to perform Feature Extraction on the


                                                            24
Luca Corvitto et al. CEUR Workshop Proceedings                                                                        19–31


pre-processed non-augmented dataset and then use a               Table 3
SVM, the Support Vector Classifier (SVC) model from              Parameters used to compute the Spectrogram
scikit learn [77], to perform the classification process tak-
                                                                               Log Mel Spectrum Parameters
ing the extracted features in input. As it was explained in
the previous section 4, wav2vec2.0 takes a raw waveform                        Sample rate           22050
signal as input, 3 seconds clips in WAV format in our                          Windows length         2048
case, then extracts audio features from them following                         Hop length             512
what it had learned in its previous training. The extracted                    N mels                 128
features were then standardized using the StandardScaler
from scikit learn, removing the mean and scaling them to         Table 4
unit variance. The standardization of a dataset is a com-        Architectures Accuracy
mon requirement for many ML estimators: they might
behave badly if the individual features do not more or less               Models                         Accuracy
look like standard normally distributed data (e.g. Gaus-                  LASSO (Full Model) [78]             0.84
sian with 0 mean and unit variance). Finally, we fitted                   1NN CHI Strategy [79]              0.8832
the SVM using a linear kernel.                                            LMT BL Strategy [79]               0.9269
   Using the SVM model as a classifier was our first at-                  MLP BL Strategy [79]               0.9013
tempt to cope with the limited number of samples at our                   NB BL Strategy [79]                0.9269
                                                                          CNN [80]                           0.8421
disposal. Once the dataset was augmented we ceased
to use the SVM due to its intrinsic limitations at work-                  Our Models                     Accuracy
ing with large datasets; so we opted for a complete DL                    Wav2vec2.0 + SVM                   0.6627
approach.                                                                 Wav2vec2.0 + FC                    0.9661
   For our second architecture, we substituted the classi-                Log Mel Spectrogram + CNN          0.9362
fier head with a simple Fully Connected (FC), or linear,
layer, keeping the wav2vec 2.0 model to perform the
Feature Extraction, this time, on the augmented dataset.         probably due to the magnitude of the feature space ex-
We trained the model for 5 epochs through the Trainer            tracted by the wav2vec 2.0 model.
class by HuggingFace on a batch of 32 samples each, set-            Using, instead, an augmented dataset together within
ting the learning rate to 2𝑒 − 5 after a warm-up period          a DL approach we manage to reach a very high value
at a ratio of 0.1 and decreasing its value linearly till the     of accuracy, the highest of our models. The wav2vec
end of the training.                                             2.0 feature extractor, having enough data to work with,
   The last architecture tested was a CNN, more precisely        managed to extract the key features and information
resnet34, that received as input the log mel spectrogram         needed to correctly identify which voice belongs to a
of the audios and generated as output the labels of the          healthy speaker or an impaired one.
given audio. All the procedures to extract the spectro-             The CNN model that was fine-tuned with Log Mel
gram were carried on with the librosa library, firstly the       Spectrum features achieved great accuracy in labeling
sample was resampled with a new rate of 22050, then              samples, unfortunately, through a more accurate analysis
the mel spectrogram extracted was normalized and fi-             of the confusion matrices shown in 5, 6, and 7, we dis-
nally scaled. Regarding the CNN, only the last layer was         covered that the number of false negatives is extremely
modified, it was replaced with a linear layer that had two       high compared to the false positives. In the medical field,
output channels and the whole model was fine-tuned               especially for tools helping with diagnosis, it is crucial
without freezing the previous layers. Training was car-          to have the smallest number of false negatives, since an
ried out for 50 epochs, the learning rate started at 2𝑒 − 4      undetected disease is much worse than a false positive,
and decayed by a factor of 10 every 10 epochs; the loss          medical operators could be missing a lot of vital anoma-
function used was the CrossEntropyLoss. All parameters           lies and in time they will lose trust in the system. In
used to compute the spectrum are shown in Table 3.               our case recall is way more important than the preci-
                                                                 sion score, from Table 5 we can see that the CNN model
5.2. Evaluations                                                 reaches only a recall score of 0.85, on the other hand
                                                                 wav2vec 2.0 achieves a better recall and F1 Score.
In Table 4 we show the accuracy of our architectures,
compared with others architectures [78] As we can see,
the first model is the one with the lowest score. This 6. Limitations and Future Works
means that, despite the ability of the SVM to avoid over-
fitting on the poor quantity of data provided, it cannot It is of critical importance to examine our achievements
accurately detect the speakers affected by SLI. This is and acknowledge the constraints that affect our work.


                                                            25
Luca Corvitto et al. CEUR Workshop Proceedings                                                                        19–31


Table 5
Architectures overall performances


                                                                            + H D O W K \
                                                                                                             
  Model                   F1 Score    Precision    Recall
  Wav2vec2.0 + SVM         0.6316      0.6923      0.5806
  Wav2vec2.0 + FC          0.9655      0.9641      0.9668
  Spectrogram + CNN        0.9187      0.9983      0.8508                                                    


                                                                            6 / ,
                                                                                             + H D O W K \    6 / ,
While our research has given promising results, the fol-
lowing section delves into the limitations that shape our
                                                          Figure 5: Wav2vec 2.0 + SVM confusion matrix
results and sets a base for future possible improvements.

6.1. Limitations
                                                                 6.2. Future Works
The lack of data quantity and quality is one of our major
constraints. The problem of data scarcity has already            Future works should focus on the creation of a new
been addressed in section 3 so we will now talk about            dataset comprising people speaking different languages,
quality.                                                         since it is not yet known, to our knowledge, whether flu-
   In the realm of ML and DL, it has been well docu-             ency problems can be generalized in all languages and a
mented that the issue of low-quality data and disparities        wide age range, knowing that the features and the overall
in data collection methodologies exacerbate the inherent         characteristics of the voice between children and adults
biases within the data when utilized for training algo-          change in general, due to their anatomical differences
rithms, a clear example is given by the societal or political    [86].
biases reflected in word embeddings or large language               Given the technological advancement in the field of
models [81, 82]. This concern arises when the data col-          generative audio with astonishing tools such as the au-
lected for training purposes exhibits significant varia-         dio manipulation software produced by ElevenLabs [87],
tions in quality and collection techniques, resulting in         which can clone voices, generate new ones, translate
a heightened vulnerability to intrinsic biases within the        them into other languages, and make them read texts,
data. Such biases can subsequently propagate through             new kinds of audio enhancement can be experimented
the training process, influencing the performance and            with, and although they cannot be used now, because
fairness of ML and DL algorithms leading to further dis-         they cannot replicate stuttering or other kinds of fluency
parities and discrimination in the real world, due to the        features that characterize people affected with SLI yet,
accessibility to such tools [83, 84]. Particularly, in our       they are promising tools to take into consideration for
work, the collection of English speakers affected by SLI         the near future.
presents the limitation of containing mostly speakers
with American accents. In real-world applications this           7. Conclusions
can have negative effects on the model performance, for
example, the algorithm could achieve higher and better           This work proposes a novel approach to Speech and Lan-
results with American people rather than with Mexican            guage Impairment (SLI) detection, based solely on audio
ones, or other English-speaking minority ethnic groups           and AI audio-based techniques, together within an en-
of people whose accent differs from the standard Ameri-          tirely new dataset composed of English speakers affected
can one [84].                                                    by SLI. The results show that, even with some limitations
   Another limitation of our dataset is that it does not         related to the scarcity of data available, Deep Learning
contain children speakers. This is because finding such          methods can achieve accurate estimations on healthy
materials on the web is often difficult, and it is more          or impaired speakers. In particular, wav2vec 2.0, with a
difficult to create them from scratch due to the small           Fully Connected layer as the classification head, reaches
number of certified children affected by SLI and, since          an accuracy of over 96% on our test set. Our findings also
they are minors, due to more strict privacy concerns.            confirm that data audio augmentation techniques are fun-
The most used dataset in this field [85] consists of one         damental to training Deep Learning models adequately.
second clips of Czech speaking children, both healthy or
affected by SLI. Although this dataset could be useful for
the detection of SLI, it is limited to the Czech language
and children speakers. This kind of limitation is common
in the healthcare field, especially in SLI detection.


                                                            26
Luca Corvitto et al. CEUR Workshop Proceedings                                                                      19–31


                                                                     doi:10.1161/CIRCRESAHA.115.306013, cited
                                                                     by: 44; All Open Access, Bronze Open Access,
                                                                     Green Open Access.
               + H D O W K \
                                                 
                                                                 [5] E. Iacobelli, V. Ponzi, S. Russo, C. Napoli, Eye-
                                                                     tracking system with low-end hardware: Devel-
                                                                     opment and evaluation, Information (Switzerland)
                                                                14 (2023). doi:10.3390/info14120644.
               6 / ,


                                                                 [6] M. Woźniak, D. Połap, R. K. Nowicki, C. Napoli,
                                 + H D O W K \    6 / ,
                                                                     G. Pappalardo, E. Tramontana, Novel approach
                                                                     toward medical signals classifier, in: Proceedings
                                                                     of the International Joint Conference on Neural
Figure 6: Wav2vec 2.0 + FC confusion matrix
                                                                     Networks, volume September 2015, 2015. doi:10.
                                                                     1109/IJCNN.2015.7280556.
                                                                 [7] P. Zinemanas, M. Rocamora, M. Miron, F. Font,
                                                                     X. Serra, An interpretable deep learning model
                                                                     for automatic sound classification, Electronics 10
            + H D O W K \


                                                 
                                                                     (2021). URL: https://www.mdpi.com/2079-9292/10/
                                                                     7/850. doi:10.3390/electronics10070850.
                                                                 [8] M. Azimi, U. Roedig, Room Identifica-
                                                               tion with Personal Voice Assistants (Ex-
            6 / ,


                                                                     tended      Abstract),     2022,     pp.    317–327.
                                                                     doi:10.1007/978-3-030-95484-0_19.
                                + H D O W K \     6 / ,
                                                                 [9] J. Kapočiūṫe-Dzikieṅe,        A domain-specific
                                                                     generative      chatbot     trained    from     little
Figure 7: Log Mel Spectrogram + CNN confusion matrix
                                                                     data,        Applied Sciences 10 (2020). URL:
                                                                     https://www.mdpi.com/2076-3417/10/7/2221.
                                                                     doi:10.3390/app10072221.
References                                                      [10] S. Shah, Z. Tariq, Y. Lee, Audio iot analytics for
                                                                     home automation safety, 2018, pp. 5181–5186.
 [1] A. Le Glaz, Y. Haralambous, D.-H. Kim-Dufor,                    doi:10.1109/BigData.2018.8622587.
     P. Lenca, R. Billot, T. C. Ryan, J. Marsh, J. DeVylder,    [11] F. Fiani, S. Russo, C. Napoli, An advanced solu-
     M. Walter, S. Berrouiguet, C. Lemey, Machine                    tion based on machine learning for remote emdr
     learning and natural language processing in mental              therapy, Technologies 11 (2023). doi:10.3390/
     health: Systematic review, J Med Internet Res 23                technologies11060172.
     (2021) e15708.                                             [12] S. Gholizadeh, Z. Leman, B. T. Baharudin, A review
 [2] D. I. Patrício, R. Rieder, Computer vision and                  of the application of acoustic emission technique in
     artificial intelligence in precision agriculture for            engineering, Structural Engineering and Mechan-
     grain crops: A systematic review,             Comput-           ics 54 (2015) 1075–1095. doi:10.12989/sem.2015.
     ers and Electronics in Agriculture 153 (2018) 69–               54.6.1075.
     81. URL: https://www.sciencedirect.com/science/            [13] H. Lozano, I. Hernáez, A. Picón, J. Camarena,
     article/pii/S0168169918305829. doi:https://doi.                 E. Navas, Audio classification techniques in home
     org/10.1016/j.compag.2018.08.001.                               environments for elderly/dependant people, in:
 [3] I. E. Tibermacine, A. Tibermacine, W. Guettala,                 K. Miesenberger, J. Klaus, W. Zagler, A. Karsh-
     C. Napoli, S. Russo, Enhancing sentiment anal-                  mer (Eds.), Computers Helping People with Special
     ysis on seed-iv dataset with vision transformers:               Needs, Springer Berlin Heidelberg, Berlin, Heidel-
     A comparative study, in: ACM International                      berg, 2010, pp. 320–323.
     Conference Proceeding Series, 2023, p. 238 – 246.          [14] D. T. Blumstein, D. J. Mennill, P. Clemins, L. Girod,
     doi:10.1145/3638985.3639024.                                    K. Yao, G. Patricelli, J. L. Deppe, A. H. Krakauer,
 [4] S. B. Scruggs, K. Watson, A. I. Su, H. Hermjakob,               C. Clark, K. A. Cortopassi, S. F. Hanser, B. Mc-
     J. R. Yates, M. L. Lindsey, P. Ping, Harnessing                 Cowan, A. M. Ali, A. N. G. Kirschel, Acoustic
     the heart of big data, Circulation Research 116                 monitoring in terrestrial environments using
     (2015) 1115 – 1119. URL: https://www.scopus.com/                microphone arrays: applications, technologi-
     inward/record.uri?eid=2-s2.0-84930702552&doi=                   cal considerations and prospectus, Journal of
     10.1161%2fCIRCRESAHA.115.306013&partnerID=                      Applied Ecology 48 (2011) 758–767. URL: https:
     40&md5=4244ed52d2f51fa5c08f02ca67e4103e.                        //besjournals.onlinelibrary.wiley.com/doi/abs/10.


                                                           27
Luca Corvitto et al. CEUR Workshop Proceedings                                                                    19–31


     1111/j.1365-2664.2011.01993.x. doi:https://doi.                Disabil. 52 (2019) 351–365.
     org/10.1111/j.1365-2664.2011.01993.x.                     [26] J. C. Lee, J. B. Tomblin, Reinforcement learning in
[15] D. V. M. Bishop, Uncommon understanding: Devel-                young adults with developmental language impair-
     opment and disorders of language comprehension                 ment, Brain Lang. 123 (2012) 154–163.
     in children, Psychology Press/Erlbaum (UK) Taylor         [27] R. A. Ahire, Nitin, A. Wagh, Eeg based identifica-
     and Francis, 1997.                                             tion of learning disabilities using machine learning
[16] H. CLAHSEN, The grammatical characterization of                algorithms, J Neurol Disord (2022).
     developmental dysphasia 27 (1989) 897–920. URL:           [28] O. O. Abayomi-Alli, R. Damaševičius, A. Qazi,
     https://doi.org/10.1515/ling.1989.27.5.897. doi:doi:           M. Adedoyin-Olowe, S. Misra, Data augmenta-
     10.1515/ling.1989.27.5.897.                                    tion and deep learning methods in sound classi-
[17] M. L. Rice, K. Wexler, P. L. Cleave, Specific language         fication: A systematic review, Electronics 11 (2022).
     impairment as a period of extended optional infini-            URL: https://www.mdpi.com/2079-9292/11/22/3795.
     tive, Journal of Speech, Language, and Hearing                 doi:10.3390/electronics11223795.
     Research 38 (1995) 850–863. URL: https://pubs.asha.       [29] C. Napoli, G. Pappalardo, E. Tramontana, Using
     org/doi/abs/10.1044/jshr.3804.850. doi:10.1044/                modularity metrics to assist move method refactor-
     jshr.3804.850.                                                 ing of large systems, in: Proceedings - 2013 7th
[18] H. K. van der Lely, Domain-specific cognitive                  International Conference on Complex, Intelligent,
     systems: insight from grammatical-sli, Trends                  and Software Intensive Systems, CISIS 2013, 2013,
     in Cognitive Sciences 9 (2005) 53–59. URL: https:              p. 529 – 534. doi:10.1109/CISIS.2013.96.
     //doi.org/10.1016/j.tics.2004.12.002. doi:10.1016/        [30] D. Połap, M. Woźniak, C. Napoli, E. Tramontana,
     j.tics.2004.12.002.                                            Real-time cloud-based game management system
[19] D. V. M. Bishop, Ten questions about terminology               via cuckoo search algorithm, International Journal
     for children with unexplained language problems,               of Electronics and Telecommunications 61 (2015)
     Int. J. Lang. Commun. Disord. 49 (2014) 381–415.               333 – 338. doi:10.1515/eletel-2015-0043.
[20] J. H. Beitchman, B. Wilson, E. B. Brownlie, H. Wal-       [31] Z. Mushtaq, S.-F. Su, Q.-V. Tran, Spectral im-
     ters, A. Inglis, W. Lancee, Long-term consistency in           ages based environmental sound classification
     speech/language profiles: II. behavioral, emotional,           using cnn with meaningful data augmentation,
     and social outcomes, J. Am. Acad. Child Adolesc.               Applied Acoustics 172 (2021) 107581. URL:
     Psychiatry 35 (1996) 815–825.                                  https://www.sciencedirect.com/science/article/pii/
[21] T. L. Stanton-Chapman, L. M. Justice, L. E. Skibbe,            S0003682X2030685X. doi:https://doi.org/10.
     S. L. Grant,         Social and behavioral charac-             1016/j.apacoust.2020.107581.
     teristics of preschoolers with specific language          [32] M. Woźniak, D. Połap, C. Napoli, E. Tramontana,
     impairment, Topics in Early Childhood Spe-                     Graphic object feature extraction system based on
     cial Education 27 (2007) 98–109. URL: https://doi.             cuckoo search algorithm, Expert Systems with Ap-
     org/10.1177/02711214070270020501. doi:10.1177/                 plications 66 (2016) 20 – 31. doi:10.1016/j.eswa.
     02711214070270020501.                                          2016.08.068.
[22] J. Wagner, D. Schiller, A. Seiderer, E. André, Deep       [33] S. Chen, E. Dobriban, J. Lee, Invariance reduces
     learning in paralinguistic recognition tasks: Are              variance: Understanding data augmentation in deep
     hand-crafted features still relevant?, in: Inter-              learning and beyond, ArXiv abs/1907.10905 (2019).
     speech, 2018. URL: https://api.semanticscholar.org/            URL: https://api.semanticscholar.org/CorpusID:
     CorpusID:52192644.                                             198895147.
[23] Y. Tokozume, T. Harada, Learning environmental            [34] C. Shorten, T. M. Khoshgoftaar,           A survey
     sounds with end-to-end convolutional neural net-               on image data augmentation for deep learning,
     work, in: 2017 IEEE International Conference on                Journal of Big Data 6 (2019) 60. URL: https://
     Acoustics, Speech and Signal Processing (ICASSP),              doi.org/10.1186/s40537-019-0197-0. doi:10.1186/
     2017, pp. 2721–2725. doi:10.1109/ICASSP.2017.                  s40537-019-0197-0.
     7952651.                                                  [35] C. Napoli, G. Pappalardo, E. Tramontana, Z. Marsza-
[24] J. Lee, J. Park, K. L. Kim, J. Nam, Samplecnn: End-            lek, D. Polap, M. Wozniak, Simplified firefly algo-
     to-end deep convolutional neural networks using                rithm for 2d image key-points search, in: IEEE SSCI
     very small filters for music classification, Applied           2014 - 2014 IEEE Symposium Series on Computa-
     Sciences 8 (2018). URL: https://www.mdpi.com/                  tional Intelligence - CIHLI 2014: 2014 IEEE Sym-
     2076-3417/8/1/150. doi:10.3390/app8010150.                     posium on Computational Intelligence for Human-
[25] L. M. Justice, W.-Y. Ahn, J. A. R. Logan, Identifying          Like Intelligence, Proceedings, 2014. doi:10.1109/
     children with clinical language disorder: An appli-            CIHLI.2014.7013395.
     cation of machine-learning classification, J. Learn.      [36] A. Greco, N. Petkov, A. Saggese, M. Vento, Aren:


                                                          28
Luca Corvitto et al. CEUR Workshop Proceedings                                                                  19–31


     A deep learning approach for sound event recog-         [45] Y. Leng, W. Zhao, C. Lin, C. Sun, R. Wang, Q. Yuan,
     nition using a brain inspired representation, IEEE           D. Li, Lda-based data augmentation algorithm for
     Transactions on Information Forensics and Security           acoustic scene classification, Knowledge-Based Sys-
     15 (2020) 3610–3624. doi:10.1109/TIFS.2020.                  tems 195 (2020) 105600. doi:10.1016/j.knosys.
     2994740.                                                     2020.105600.
[37] D. Połap, M. Woźniak, C. Napoli, E. Tramontana,         [46] V. M. Praseetha, P. P. Joby, Speech emotion recog-
     R. Damaševičius, Is the colony of ants able to rec-          nition using data augmentation, International
     ognize graphic objects?, Communications in Com-              Journal of Speech Technology 25 (2022) 783–792.
     puter and Information Science 538 (2015) 376 – 387.          URL: https://doi.org/10.1007/s10772-021-09883-3.
     doi:10.1007/978-3-319-24770-0_33.                            doi:10.1007/s10772-021-09883-3.
[38] Z. Mushtaq, S.-F. Su,        Environmental sound        [47] S. Lalitha, D. Gupta, M. Zakariah, Y. A. Alotaibi,
     classification using a regularized deep convolu-             Investigation of multilingual and mixed-lingual
     tional neural network with data augmentation,                emotion recognition using enhanced cues with
     Applied Acoustics 167 (2020) 107389. URL:                    data augmentation,          Applied Acoustics 170
     https://www.sciencedirect.com/science/article/pii/           (2020) 107519. URL: https://www.sciencedirect.
     S0003682X2030493X. doi:https://doi.org/10.                   com/science/article/pii/S0003682X2030623X.
     1016/j.apacoust.2020.107389.                                 doi:https://doi.org/10.1016/j.apacoust.
[39] O. Novotny, O. Plchot, O. Glembek, J. H. Cer-                2020.107519.
     nocky, L. Burget, Analysis of dnn speech sig-           [48] M. Schmitt, C. Janott, V. Pandit, K. Qian, C. Heiser,
     nal enhancement for robust speaker recognition,              W. Hemmert, B. Schuller, A bag-of-audio-words
     Computer Speech and Language 58 (2019) 403–                  approach for snore sounds’ excitation localisation,
     421. URL: https://www.sciencedirect.com/science/             in: Speech Communication; 12. ITG Symposium,
     article/pii/S0885230818303607. doi:https://doi.              2016, pp. 1–5.
     org/10.1016/j.csl.2019.06.004.                          [49] H. Lachambre, B. Ricaud, G. Stempfel, B. Torrésani,
[40] S. Mertes, A. Baird, D. Schiller, B. W. Schuller,            C. Wiesmeyr, D. Onchis-Moaca, Optimal window
     E. André, An evolutionary-based generative ap-               and lattice in gabor transform. application to audio
     proach for audio data augmentation, in: 2020 IEEE            analysis, in: 2015 17th International Symposium
     22nd International Workshop on Multimedia Signal             on Symbolic and Numeric Algorithms for Scientific
     Processing (MMSP), 2020, pp. 1–6. doi:10.1109/               Computing (SYNASC), 2015, pp. 109–112. doi:10.
     MMSP48831.2020.9287156.                                      1109/SYNASC.2015.25.
[41] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu,              [50] E. Garcia-Ceja, M. Riegler, A. K. Kvernberg, J. Tor-
     B. Zoph, E. D. Cubuk, Q. V. Le,           SpecAug-           resen, User-adaptive models for activity and
     ment: A simple data augmentation method                      emotion recognition using deep transfer learn-
     for automatic speech recognition, in: Inter-                 ing and data augmentation, User Modeling and
     speech 2019, ISCA, 2019. URL: https://doi.org/10.            User-Adapted Interaction 30 (2020) 365–393. URL:
     21437%2Finterspeech.2019-2680. doi:10.21437/                 https://doi.org/10.1007/s11257-019-09248-1. doi:10.
     interspeech.2019-2680.                                       1007/s11257-019-09248-1.
[42] E. K. Wang, J. Yu, C.-M. Chen, S. Kumari, J. J.         [51] H. Ykhlef, F. Ykhlef, S. Chiboub, Experimental
     P. C. Rodrigues,        Data augmentation for in-            design and analysis of sound event detection sys-
     ternet of things dialog system,        Mobile Net-           tems: Case studies, in: 2019 6th International Con-
     works and Applications 27 (2022) 158–171. URL:               ference on Image and Signal Processing and their
     https://doi.org/10.1007/s11036-020-01638-9. doi:10.          Applications (ISPA), 2019, pp. 1–6. doi:10.1109/
     1007/s11036-020-01638-9.                                     ISPA48434.2019.8966798.
[43] T. Koike, K. Qian, B. W. Schuller, Y. Yamamoto,         [52] S. Lalitha, D. Gupta, M. Zakariah, Y. A. Alotaibi,
     Transferring cross-corpus knowledge: An investi-             Investigation of multilingual and mixed-lingual
     gation on data augmentation for heart sound classi-          emotion recognition using enhanced cues with
     fication, in: 2021 43rd Annual International Confer-         data augmentation,          Applied Acoustics 170
     ence of the IEEE Engineering in Medicine & Biology           (2020) 107519. URL: https://www.sciencedirect.
     Society (EMBC), IEEE, 2021.                                  com/science/article/pii/S0003682X2030623X.
[44] V. Basu, S. Rana, Respiratory diseases recognition           doi:https://doi.org/10.1016/j.apacoust.
     through respiratory sound with the help of deep              2020.107519.
     neural network, in: 2020 4th International Confer-      [53] V.-T. Tran, W.-H. Tsai, Stethoscope-sensed speech
     ence on Computational Intelligence and Networks              and breath-sounds for person identification with
     (CINE), 2020, pp. 1–6. doi:10.1109/CINE48825.                sparse training data, IEEE Sensors Journal 20 (2020)
     2020.234388.                                                 848–859. doi:10.1109/JSEN.2019.2945364.


                                                        29
Luca Corvitto et al. CEUR Workshop Proceedings                                                                     19–31


[54] T. A. M. Celin, T. Nagarajan, P. Vijayalakshmi,           [63] T. M. Stoumpos AI, Kitsios F, Digital transforma-
     Data augmentation using virtual microphone array               tion in healthcare: Technology acceptance and its
     synthesis and multi-resolution feature extraction              applications, Int J Environ Res Public Health (2023).
     for isolated word dysarthric speech recognition,               doi:10.3390/ijerph20043407.
     IEEE Journal of Selected Topics in Signal Process-        [64] G. Gopal, C. Suter-Crazzolara, L. Toldo, W. Eber-
     ing 14 (2020) 346–354. doi:10.1109/JSTSP.2020.                 hardt, Digital transformation in healthcare –
     2972161.                                                       architectures of present and future information
[55] J. Ye, T. Kobayashi, M. Murakawa, Urban sound                  technologies, Clinical Chemistry and Labora-
     event classification based on local and global                 tory Medicine (CCLM) 57 (2019) 328–335. URL:
     features aggregation,       Applied Acoustics 117              https://doi.org/10.1515/cclm-2018-0658. doi:doi:
     (2017) 246–256. URL: https://www.sciencedirect.                10.1515/cclm-2018-0658.
     com/science/article/pii/S0003682X16302274.                [65] Wave file format specification, ????             URL:
     doi:https://doi.org/10.1016/j.apacoust.                        https://www.mmsp.ece.mcgill.ca/Documents/
     2016.08.002, acoustics in Smart Cities.                        AudioFormats/WAVE/WAVE.html.
[56] V. Ramesh, K. Vatanparvar, E. Nemati, V. Nathan,          [66] A. Baevski, H. Zhou, A. Mohamed, M. Auli,
     M. M. Rahman, J. Kuang, CoughGAN: Generating                   Wav2vec 2.0: A framework for self-supervised
     synthetic coughs that improve respiratory disease              learning of speech representations, in: Proceed-
     classification(), Annu Int Conf IEEE Eng Med Biol              ings of the 34th International Conference on Neural
     Soc 2020 (2020) 5682–5688.                                     Information Processing Systems, NIPS’20, Curran
[57] N. Yella, B. Rajan, Data augmentation using gan for            Associates Inc., Red Hook, NY, USA, 2020.
     sound based covid 19 diagnosis, in: 2021 11th IEEE        [67] V. Panayotov, G. Chen, D. Povey, S. Khudanpur,
     International Conference on Intelligent Data Acqui-            Librispeech: An asr corpus based on public do-
     sition and Advanced Computing Systems: Technol-                main audio books, in: 2015 IEEE International Con-
     ogy and Applications (IDAACS), volume 2, 2021,                 ference on Acoustics, Speech and Signal Process-
     pp. 606–609. doi:10.1109/IDAACS53288.2021.                     ing (ICASSP), 2015, pp. 5206–5210. doi:10.1109/
     9660990.                                                       ICASSP.2015.7178964.
[58] H. Lee, J. Lee, Neural network prediction of sound        [68] T. Grósz, D. Porjazovski, Y. Getman, S. Kadiri, M. Ku-
     quality via domain knowledge-based data augmen-                rimo, Wav2vec2-based paralinguistic systems to
     tation and bayesian approach with small data sets,             recognise vocalised emotions and stuttering, in:
     Mechanical Systems and Signal Processing 157                   Proceedings of the 30th ACM International Con-
     (2021) 107713. URL: https://www.sciencedirect.com/             ference on Multimedia, MM ’22, Association for
     science/article/pii/S0888327021001084. doi:https:              Computing Machinery, New York, NY, USA, 2022,
     //doi.org/10.1016/j.ymssp.2021.107713.                         p. 7026–7029. URL: https://doi.org/10.1145/3503161.
[59] D. Koszewski, B. Kostek, Musical instrument tag-               3551572. doi:10.1145/3503161.3551572.
     ging using data augmentation and effective noisy          [69] J. Liu, A. Wumaier, D. Wei, S. Guo, Automatic
     data processing, Journal of the Audio Engineer-                speech disfluency detection using wav2vec2. 0 for
     ing Society 68 (2020) 57–65. doi:10.17743/jaes.                different languages with variable lengths, Applied
     2019.0050.                                                     Sciences 13 (2023) 7579.
[60] Z. Zhang, J. Han, K. Qian, C. Janott, Y. Guo,             [70] I. Jordal, A. Tamazian, E. T. Chourdakis, An-
     B. Schuller, Snore-GANs: Improving automatic                   gonin, askskro, N. Karpov, T. Dhyani, O. Sar-
     snore sound classification with synthesized data,              ioglu, kvilouras, E. Berk, F. Mirus, J.-Y. Lee,
     IEEE J Biomed Health Inform 24 (2019) 300–310.                 K. Choi, MarvinLvn, SolomidHero, T. Alum,
[61] K. P. Seastedt, P. Schwab, Z. O’Brien, E. Wakida,              iver56/audiomentations: v0.33.0 (????). doi:10.
     K. Herrera, P. G. F. Marcelo, L. Agha-Mir-Salim, X. B.         5281/zenodo.7010042.
     Frigola, E. B. Ndulue, A. Marcelo, L. A. Celi, Global     [71] N. Richardson, I. Cook, N. Crane, D. Dun-
     healthcare fairness: We should be sharing more,                nington, R. François, J. Keane, D. Moldovan-
     not less, data, PLOS Digital Health 1 (2022) 1–13.             Grünfeld, J. Ooms, Apache Arrow, ar-
     URL: https://doi.org/10.1371/journal.pdig.0000102.             row:        Integration to ’Apache’ ’Arrow’,
     doi:10.1371/journal.pdig.0000102.                              2023.           Https://github.com/apache/arrow/,
[62] M. Paul, L. Maglaras, M. A. Ferrag, I. Almomani,               https://arrow.apache.org/docs/r/.
     Digitization of healthcare sector: A study on             [72] B. Truax, Handbook for acoustic ecology, Leonardo
     privacy and security concerns, ICT Express 9 (2023)            13 (1980) 83.
     571–588. URL: https://www.sciencedirect.com/              [73] R. N. Bracewell, R. N. Bracewell, The Fourier trans-
     science/article/pii/S2405959523000243. doi:https:              form and its applications, volume 31999, McGraw-
     //doi.org/10.1016/j.icte.2023.02.007.                          Hill New York, 1986.


                                                          30
Luca Corvitto et al. CEUR Workshop Proceedings                                                          19–31


[74] C. Cortes, V. Vapnik, Support-vector networks, [87] M. S. Piotr Dabkowski, Eleven labs, 2023. URL: https:
     Machine learning 20 (1995) 273–297.                   //elevenlabs.io/voice-lab.
[75] K. He, X. Zhang, S. Ren, J. Sun, Deep resid-
     ual learning for image recognition, 2015.
     arXiv:1512.03385.
[76] T. W. Clément Delangue, Julien Chaumond,
     Wav2vec2, 2022. URL: https://huggingface.co/docs/
     transformers/model_doc/wav2vec2.
[77] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
     B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
     R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
     D. Cournapeau, M. Brucher, M. Perrot, E. Duch-
     esnay, Scikit-learn: Machine learning in Python,
     Journal of Machine Learning Research 12 (2011)
     2825–2830.
[78] L. M. Justice, W.-Y. Ahn, J. A. Logan, Identifying
     children with clinical language disorder: an appli-
     cation of machine-learning classification, Journal
     of learning disabilities 52 (2019) 351–365.
[79] J. Gaspers, K. Thiele, P. Cimiano, A. Foltz, P. Sten-
     neken, M. Tscherepanow, An evaluation of mea-
     sures to dissociate language and communication
     disorders from healthy controls using machine
     learning techniques, in: Proceedings of the 2nd
     acm sighit international health informatics sympo-
     sium, 2012, pp. 209–218.
[80] C. Kanimozhiselvi, S. Santhiya, Communication
     disorder identification from recorded speech using
     machine learning assisted mobile application, in:
     2021 Third International Conference on Intelligent
     Communication Technologies and Virtual Mobile
     Networks (ICICV), IEEE, 2021, pp. 789–793.
[81] A. Caliskan, J. J. Bryson, A. Narayanan, Semantics
     derived automatically from language corpora con-
     tain human-like biases, Science 356 (2017) 183–186.
[82] D. Rozado, The political biases of ChatGPT, Soc.
     Sci. (Basel) 12 (2023) 148.
[83] M. A. Gianfrancesco, S. Tamang, J. Yazdany,
     G. Schmajuk, Potential biases in machine learn-
     ing algorithms using electronic health record data,
     JAMA Intern. Med. 178 (2018) 1544–1547.
[84] H. Ibrahim, X. Liu, N. Zariffa, A. D. Mor-
     ris, A. K. Denniston,         Health data poverty:
     an assailable barrier to equitable digital health
     care, The Lancet Digital Health 3 (2021) e260–
     e265. URL: https://www.sciencedirect.com/science/
     article/pii/S2589750020303174. doi:https://doi.
     org/10.1016/S2589-7500(20)30317-4.
[85] P. Grill, J. Tučková, Speech databases of typical
     children and children with SLI, PLoS One 11 (2016)
     e0150365.
[86] A. McAllister, P. Sjölander, Children’s voice and
     voice disorders, in: Seminars in speech and lan-
     guage, volume 34, Thieme Medical Publishers, 2013,
     pp. 071–079.


                                                     31