<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Cagliari, Italy.
* Corresponding author.
† These authors contributed equally.
$ lschettino@unibz.it (L. Schettino);
vincenzonorman.vitale@unina.it (V. N. Vitale); mmafia@unior.it
(M. Mafia)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Using End-to-End Automatic Speech Recognisers' Internals to Model Disfluencies in Italian Patients with Early-stage Parkinson's Disease.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Loredana Schettino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincenzo Norman Vitale</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marta Mafia</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Free University of Bozen-Bolzano</institution>
          ,
          <addr-line>Piazza Università, 1, 39100 Bolzano</addr-line>
          ,
          <country country="IT">Italia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Naples Federico II</institution>
          ,
          <addr-line>C.so Umberto I, 40, 80138 Napoli</addr-line>
          ,
          <country country="IT">Italia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Naples L'Orientale, Italy</institution>
          ,
          <addr-line>Via Chiatamone 61/62 - 80121 Napoli</addr-line>
          ,
          <country country="IT">Italia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Alterations in speakers' articulation and phonation are among the earliest symptoms of Parkinsons' Disease (PD). However, clinical decision-making is currently based on holistic ratings of speech intelligibility, while studies on PD detection mostly involve highly complex and hardly interpretable models. This study builds upon previous works on Italian that showed how the characteristics of disfluency phenomena may be considered as an index of impairment at the very onset of the disease by investigating whether even less complex (supervised) end-to-end speech recognition systems (E2E ASR) can model disfluency phenomena in Italian PD speech and how this could support PD discrimination tasks. Exploiting the ability of E2E ASRs to progressively model useful features for discriminating between PD and non-PD speakers provides valuable insight into the ASRs' internal dynamics as well as for the development of decision support systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Disfluencies</kwd>
        <kwd>Spontaneous Speech</kwd>
        <kwd>Parkinson's Disease</kwd>
        <kwd>Conformer</kwd>
        <kwd>Probing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        analysis could therefore be beneficial, especially in the
diagnostic phase: while a cure for PD has yet to be found,
Parkinson’s Disease (PD) is a chronic neurodegenerative early diagnosis is crucial for access to pharmacological
disorder steadily on the rise in terms of prevalence and and non-pharmacological interventions.
incidence [1, 2]: more than 10 million individuals world- However, developing machine learning tools for
critiwide are afected by PD, mainly among the population cal areas like early Parkinson’s disease detection is
sigaged 65 and over, and this number is expected to increase nificantly hampered by data scarcity. Acquiring the
necin demographically ageing societies. Caused by deteriora- essary data is both costly, requiring specialized
linguistion or loss of dopaminergic neurons in the substantia ni- tic and medical experts, and complex, given the
inhergra of the basal ganglia, PD is generally diagnosed based ently (and fortunately) small patient sample size. To
overon clinical criteria, such as the medical history and phys- come this limitation, our study explores the use of latent
ical/neurological examinations of the patient. Although features encoded within pre-trained Automatic Speech
several experimental studies have shown that speech and Recognition (ASR) models. This approach explores the
voice alterations are among the earliest symptoms of PD possibility of eficiently utilizing limited available data
[3, 4, 5], this precious information is poorly used in clini- by leveraging knowledge distilled from vast quantities
cal decision-making. In the Unified Parkinson’s Disease of data not originally intended for this purpose.
AddiRating Scale (UPDRS), the rating tool used to assess the tionally, to further enhance the procedure, it focuses on
severity and to monitor the progression of the disease specific speech features that were observed to play a
sig[6], only one item (3.1) concerns the patient’s speech and nificant role in discriminating PD speech, even at the
suggests an assessment based on the clinician’s percep- early stages of the disease, namely speech disfluency
tion, considering above all intelligibility. The application patterns [
        <xref ref-type="bibr" rid="ref14">7</xref>
        ].
of advanced and sustainable methods of acoustic data We believe that using such a method optimises the
use of the available data by integrating domain-specific
and computational knowledge and can thus support the
development of decision support systems in data-scarce
critical contexts, as exemplified by early Parkinson’s
detection.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        sentations. Furthermore, a more recent study showed
that models based on interpretable features such as
2.1. Parkinson’s Disease Speech and prosodic, linguistic, and cognitive descriptors can
supDisfluency Patterns port the evaluation of speech deterioration in PD
patients, whereas models based on non-interpretable
feaThe loss of dopamine in the central nervous system tures achieve higher detection accuracy [19]. These
findcauses motor impairments and has an impact also on ings are often based on consideration of highly
funclaryngeal, respiratory and articulatory functions, with tional vocal paradigms, such as sustained phonation of
about 90% of individuals with PD sufering from voice and isolated segments, and involve rather complex models
speech disorders [8]. PD-related hypokinetic dysarthria relying on non-interpretable features or features
comincludes a range of alterations: hypophonia (reduced monly observed as useful for PD speech discrimination
voice volume), dysphonia (changes in voice quality), dys- as they become evident in the mid to advanced stages
rhytmia, reduced speech rate, monopitch, imprecise ar- of the disease [18]. Nonetheless, a recent study showed
ticulation [9, 10, 11, 12]. Parkinsonian speech is also com- that PD detection trials relying on a restricted number
monly referred to as ‘disfluent’, although a detailed and of meaningful features that were extracted from
concomprehensive description of the specific characteristics nected speech rather than isolated speech units achieve
of disrupted PD speech has not yet been provided [13]. accurate, as well as economical and interpretable
disStudies have mostly focused on specific types of disfluen- crimination [
        <xref ref-type="bibr" rid="ref14">7</xref>
        ]. Also, studies on the interpretability of
cies: in [14, 15], for example, stuttering-like disruptions DNN-based models, using probing techniques, provided
(one-syllable word repetitions, sound and syllable repeti- evidence that even smaller and less complex models, such
tions, sound prolongations, and blocks) were observed in as Conformer-based ones [20], can model speech features
PD patients and healthy speakers, and greater disfluency and that diferent features are encoded in DNN layers
percentages were found in pathological speech, support- at diferent depths [ 21]. In particular, it was found that
ing the relationship between stuttering and the functions higher levels capture phone identity and word identity
of the basal ganglia. In a work on repetitive speech phe- information, and the last layer before the object
funcnomena (both hyperfluent and dysfluent) [ 16], a positive tion even captures discriminating features of disfluency
correlation between the frequency of disfluencies and the phenomena, more specifically, filled pauses and
prolonduration of PD was found. However, studies have not al- gations [22].
ways considered the functions of disfluency phenomena In substance, this study builds on the following
findin PD speech and mostly involved mild-to-severe and ings from previous work on Italian PD speech:
strongly disfluent patients in experimental protocols.
      </p>
      <p>
        A recent study conducted on Italian early-stage PD • relying on natural speech material that results
subjects and on spontaneous monological speech [17], from the usual working dynamic of the vocal
apshowed that, even at the beginning of the pathology paratus during phonation proves useful for
dis(when patients’ speech is completely intelligible), the crimination [
        <xref ref-type="bibr" rid="ref14">7</xref>
        ];
observation of disfluency phenomena can reveal some al- • peculiar uses of natural speech characteristics
teration in linguistic planning and processing: the speech phenomena like disfluency phenomena may be
of PD patients was found to difer from that of sex- and considered as an index of impairment at the very
age-matched healthy speakers, in terms of the higher onset of the disease [17];
frequency of repairs, the specific functions of hesitations • less complex supervised end-to-end speech
recog(mostly used by PD patients for lexical retrieval), the lo- nition systems (E2E ASR) can model
disfluencycation of disfluency phenomena (more within-words in related features useful for their discrimination
PD speech than in the control group) and the duration [22].
of silent pauses (longer in PD than in healthy subjects).
      </p>
      <sec id="sec-2-1">
        <title>2.2. Parkinson’s Disease Automatic</title>
      </sec>
      <sec id="sec-2-2">
        <title>Detection</title>
        <sec id="sec-2-2-1">
          <title>Various studies have been devoted to developing au</title>
          <p>tomatic and objective tools to support PD diagnosis 3. Method
and the assessment of its severity [18]. Remarkable
PD detection accuracy was achieved by leveraging non- 3.1. Data and Annotation
interpretable embeddings obtained with Deep Neural
Networks (DNNs)-based self-supervised models, e.g. x- The study is based on the data described in [12]. It
convectors, Wav2Vec 2.0, HuBERT, and TRILLsson repre- sists of approximately 40 minutes of monologic speech</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>On this basis, we investigate whether less complex (supervised) E2E ASR systems can model disfluency features in Italian PD speech and how well this could support PD discrimination tasks.</title>
          <p>produced by 36 Italian native speakers from the
Campania region: 18 participants with idiopathic non-demented
PD (10 males, 8 females; 51–81 years of age, M= 65) and 18
age-matched Healthy Controls (HC, 10 males, 8 females;
54–77 years of age, M= 64). The patients were recruited
from the Movement Disorders Unit of the First Division
of Neurology at the University of Campania “Luigi
Vanvitelli”. PD participants had no prior history of language
or speech disorders, had been diagnosed with
Parkinson’s disease within the past four years (since 2021) and
showed no significant cognitive impairment, major or
minor depression, or dysthymic disorder. All participants
were asked to discuss the positive and negative aspects
of the place they were living during data collection. They
were encouraged to speak in their usual, conversational
tone and at a comfortable volume. Sociolinguistic
information for each speaker was gathered via a questionnaire,
and all participants provided written consent for the data
collection process.</p>
          <p>The analysis focused on a series of so-called “disfluency
phenomena” defined as speech management phenomena,
namely, speech material, e.g. repetitions, segmental
prolongations, pauses, and fillers that speakers can use to
monitor and efectively manage the online processes of
speech planning, coding, articulation, and reception [23].
The phenomena were identified and annotated based
on their context of occurrence, following [17] and
included the following phenomena specifically involved
in the speech planning process (Cohen’s k= 0.82, good
agreement [24]):
• Prolongations (PRLs), marked prolongation of
segmental material, e.g., laaa casa (theee house);
• Filled Pauses (FPs), non-lexical filler,
vocalizations and/or nasalizations, e.g., eeh, ehm, mhh;
• Silent Pauses (SPs), marked silences perceived as
a hesitant pause in the context of occurrence;
• Lexicalized Filled Pauses (LFPs), lexical fillers,
work as discourse markers involved in the
coverage of planning times, e.g., diciamo, insomma,
appunto... (well, let’s say, so, ...);
• Repetitions (REPs), repetition of already uttered
words or fragments of words, e.g., di di (of of) or
d- di (o- of).</p>
          <p>• selecting pre-trained models (). In
particular, two publicly available Conformer-based [20]
models with diferent decoding component were
selected: one with a Connectionist Temporal
Classification (CTC) [ 27] decoder1, namely, a
non-auto-regressive technique; one with a
Recurrent Neural Network Transducer (RNN-T),
commonly known as Transducer2, which is an
autoregressive speech transcription technique;
• building Long Short Term Memory (LSTM) and
Bidirectional LSTM (BILSTM) classifiers whose
inputs are represented by intermediate emissions
of the considered model’s encoder layers (),
combined with the appropriate sequence of labels
based on dataset annotation;
• evaluating the classifications relying on
metrics oriented to results safety rather than
performance.
3.2.1. Data Preparation</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>The considered dataset has been prepared based on a set</title>
          <p>of praat TextGrid annotation files indicating the speaker
3.2. Probing Approach and the type of disfluency according to the speech signal.
Based on previous studies investigating E2E-ASR models’ More specifically, PRLs, FPs, SPs, LFPs and REPs were
internal behaviour [21, 25, 22, 26], we employ a probing considered, resulting in a dataset with a dimension of
approach to investigate the pre-trained models’ ability to 850 segments. For each segment, the contextual
informacapture speaker and speech related markers, i.e., charac- tion preceding and following the disfluency phenomenon
teristics associated with disfluent speech segments com- has been considered, giving each segment a length of 4
bined with PD biomarkers, and whether these features
facilitate PD speech identification.</p>
          <p>The employed technique involves:</p>
        </sec>
        <sec id="sec-2-2-4">
          <title>1v1.6.0 https://huggingface.co/nvidia/stt_en_conformer_ctc_large 2v1.6.0 https://huggingface.co/nvidia/stt_en_conformer_transducer_large</title>
          <p>seconds. Then, for each encoding layer from a consid- 3.2.2. Pre-trained Models
ered pre-trained model, we extract a representation of
segments in the corresponding latent space following the
procedure described in Figure 1. In particular, for each
segment, we obtain:
We selected two publicly available Conformer-based [20]
pre-trained models built with the NVIDIA Nemo toolkit3,
both with a fragment dimension  = 40
and only difering in the decoding component.
• A sequence of intermediate emissions, namely frag- On the one hand, we considered a CTC decoder, one
ment representations in the corresponding layer’s of the most popular decoding techniques. It consists
latent space. Each fragment corresponds to a por- of a non-auto-regressive speech transcription technique
tion of  milliseconds of the input signal, where  that collapses consecutive, all-equal, transcription labels
depends on the considered model’s characteris- (character, word piece, etc.) to one label unless a special
tics. label separates them. The result is a sequence of labels
• A sequence of labels associated with each frag- shorter than or equal to the input vector sequence length.
ment, indicating whether that fragment belongs Being non-auto-regressive, it is also considered
computo a disfluency or not and, if so, whether the tationally efective, requiring less time and resources for
speaker is PD or HC. training and inference phases. On the other hand, we
considered a Transducer, which is an auto-regressive speech
transcription technique that overcomes CTC’s
limitations, being non-auto-regressive and subject to limited
label sequence length. The Transducer decoding technique
can produce label-transcription sequences longer than</p>
          <p>The resulting dataset consists of pairs of sequences
of emissions (i.e., distilled features) and corresponding
labels identified by the model and the layer from which
they were extracted.
3Nemo version 1.21.0.
the input vector sequence and models inter-dependency trained for a maximum of 100 epochs using an Adam
opin long-term transcription elements. A Transducer typ- timizer with an initial  = 0.00001. To reduce the risk of
ically comprises two sub-decoding modules: one that overfitting, we introduce a dropout neuron-selection
stratforecasts the next transcription label based on the pre- egy for the LSTM/BILSTM gates, which statistically
exvious transcriptions (prediction network) and the other cludes (with a 0.1 probability) one neuron and its weights
that combines the encoder and prediction-network out- during each training iteration [29]. Finally, an early
stopputs to produce a new transcription label (joiner net- ping mechanism was used to avoid wasting
computawork). These features improve transcription speed and tional resources. In particular, the training phase ends
performance (compared to CTC), while requiring more if the validation-loss does not decrease by a minimum
training and computational resources [28]. Also, the two of 0.001 during the last 20 epochs, which is the patience
techniques should provide diferent representations and threshold.
contributions during the training phase (backdrop) due
to their diferent dynamics in forward propagation. 3.2.4. Evaluation</p>
          <p>Note that both considered pre-trained models rely on
the same encoder architecture, but the Conformer-CTC
model has 18 encoding layers, while the
ConformerTransducer encoder has 17 layers. This resulted in 35
diferent latent space representations for the considered
dataset.</p>
        </sec>
        <sec id="sec-2-2-5">
          <title>Since our aim is to investigate whether pre-trained E2E</title>
          <p>ASR models encode features useful for the identification
of disfluency phenomena in PD speech, and whether and
how they enable the discrimination between PD and HC,
we decided to rely on metrics oriented to results safety
rather than performance. Note that a sample is classified
3.2.3. Classifiers as PD or HC if the portion related to the disfluency is
(1) detected and (2) at least 60% of frames are correctly
The classifiers internally consist of a LSTM or a BILSTM labeled as either PD or HC. The reliability of the
apmodule, followed by a Feed Forward Neural Network proach is assessed by inspecting the confusion matrices
(FFNN). The choice of LSTM and BILSTM modules is for the best LSTM or BiLSTM, CTC-based and
RNNTdriven by their capacity to capture the temporal depen- based classifiers, which provide a breakdown of correct
dencies in the input, which fits well with our objective and incorrect predictions for each class. The quantitative
of modeling temporal dependencies in the latent space analysis was further supported by a qualitative
explorepresentation of the speech signal. ration of the acoustic features emerging as relevant for</p>
          <p>Since the LSTM/BILSTM hidden-layer size is a crucial discrimination with reference to previous literature [18].
parameter, we investigate the impact of three diferent To this aim, the eGeMAPSv02 [30] feature set from the
layer sizes (hidden-layer size, ℎ), namely 160, 320 and OpenSmile toolkit [31] was selected as the basic feature
640. So, an LSTM-based classifier processes a sequence of set and inspected using the Orange software [32].
{,} emission vectors (each of length ) and produces
a new sequence of vectors with size ℎ. The two sequences
are aligned over time. At each time step , based on the 4. Results and Discussion
LSTM/BILSTM hidden-layer output, the FFNN produces
a label indicating whether the considered input repre- In this study, we considered two distinct ASR
arsents a disfluency segment, pronounced by a PD or HC chitectures, namely Conformer-CTC and
Conformerspeaker, or not. In summary, we train and evaluate many Transducer, selected for the difering capabilities of
bdlieferℎen,tR,NanNdclacsosmifiebrsi/ndaettieocntsortos (searcℎh,fo,r)tfhoer eavllidpeonscsei- tnhisemirs,die.ec.o, dCiTnCg
acnomdTproannesndtusc.erT,hbeeisnegdreecsopdecintigvemlyencohna-of disfluencies-related pathological biomarker proper- autoregressive and autoregressive by nature, are
inherties in the models’ decisions.footnoteFrameworks used ently capable of capturing diverse aspects of the speech
to implement and train the classifiers: torch==2.2.1 and signal, therefore influencing in a diferent way the
enpytorch-lightning==2.0.7 and BILSTM based classifier coding component.
wteemrepotrraailnseedn,srietisvuilttyinogf ionu∼ r c2l0a0ssmifieor/ddeeltse. cNtoort,entahmateltyhe ifer.FIitgiusrient2erreesptoinrtgs
ttoheobpsreercviesihoonwoftheeaclhayterrasincelodsecrlatsosithe minimum diference between consecutive time steps, the input provide the higher precision, while the overall
is 40 ms because the considered ASR models produce tendency, getting closer to the decoding component, is a
emissions at that rate. constant reduction, which is likely related to the specific</p>
          <p>During the training phase, the considered corpus was objective of the pre-trained models, namely, to provide
split into train, validation, and test sets using 60%, 20%, a clean transcription. However, the model that seems to
and 20% percentages while ensuring that these sets did provide the most informative and stable latent
represennot share the same speakers. Each classifier has been tation seems to be the Conformer-CTC (orange line) in
(a) LSTM with hidden size ℎ = 320 trained on the dataset
represented in the latent space of Conformer-CTC’s encoding
layer #2.
(b) BiLSTM with hidden size ℎ = 160 trained on the dataset
represented in the latent space of Conformer-CTC’s
encoding layer #2.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusion</title>
      <p>The main findings highlight that focusing on speech
correlates such as disfluency phenomena provides a
convenient choice to enhance the development of decision
support systems. The latent representation from the
intermediate encoding layer was shown to be highly
informative and quite reliable for the classification of PD
and HC speakers. In addition, the CTC decoder seemed
to provide slightly more stable performance in this task,
probably due to its non-autoregressive nature. Future
steps will involve a comparison of performance with
different pre-trained models and classification architectures.</p>
      <p>Indeed, since disfluency phenomena encompass diferent
types of phenomena (i.e. textual phenomena, like lexical
ifllers and repetitions, and phonetic phenomena, like
nonlexical fillers and prolongations), diferent approaches
may perform better on specific types.</p>
      <p>Also, the analysis led to observing (not yet noticeable)
alterations of acoustic features revealing the onset of
PD-related motor impairment. It is worth noticing that
some of the considered disfluency phenomena, namely
prolongations and filled particles, consist in prolonged
vocalisations, which are similar to the sustained vowel
traditionally used in highly controlled studies on PD
speech. Thus they provide a nice integration of data E. Nöth, Automatic detection of parkinson’s disease
eficacy and ecology. from words uttered in three diferent languages, in:
Fifteenth annual conference of the international
speech communication association, 2014.</p>
      <p>References [11] H. Ackermann, W. Ziegler, Articulatory deficits
in parkinsonian dysarthria: an acoustic analysis.,
[1] W. A. Rocca, The burden of parkinson’s disease: a Journal of Neurology, Neurosurgery &amp; Psychiatry
worldwide perspective, The Lancet Neurology 17 54 (1991) 1093–1098.</p>
      <p>
        (2018) 928–929. [12] M. Mafia, R. De Micco, M. Pettorino, M. Siciliano,
[2] J. D. Steinmetz, K. M. Seeher, N. Schiess, E. Nichols, A. Tessitore, A. De Meo, Speech rhythm variation in
B. Cao, C. Servili, V. Cavallera, E. Cousin, H. Hagins, early-stage parkinson’s disease: a study on diferent
M. E. Moberg, Global, regional, and national burden speaking tasks, Frontiers in Psychology 12 (2021)
of disorders afecting the nervous system, 1990– 668291.
2021: a systematic analysis for the global burden of [13] A. M. Goberman, M. Blomgren, Parkinsonian
disease study 2021, The Lancet Neurology 23 (2024) speech disfluencies: efects of l-dopa-related
fluctu344–381. ations, Journal of fluency disorders 28 (2003) 55–70.
[3] S. Skodda, Analysis of voice and speech perfor- [14] A. M. Goberman, M. Blomgren, E. Metzger,
Characmance in parkinson’s disease: a promising tool for teristics of speech disfluency in parkinson disease,
the monitoring of disease progression and diferen- Journal of Neurolinguistics 23 (2010) 470–478.
tial diagnosis, Neurodegenerative Disease Manage- [15] F. S. Juste, F. C. Sassi, J. B. Costa, C. R. F. de Andrade,
ment 2 (2012) 535–545. Frequency of speech disruptions in parkinson’s
dis[4] J. Rusz, R. Cmejla, T. Tykalova, H. Ruzickova, ease and developmental stuttering: a comparison
J. Klempir, V. Majerova, J. Picmausova, J. Roth, among speech tasks, Plos one 13 (2018) e0199054.
E. Ruzicka, Imprecise vowel articulation as a po- [16] T. Benke, C. Hohenstein, W. Poewe, B. Butterworth,
tential early marker of parkinson’s disease: Efect Repetitive speech phenomena in parkinson’s
disof speaking task, The Journal of the Acoustical ease, Journal of Neurology, Neurosurgery &amp;
PsySociety of America 134 (2013) 2171–2181. chiatry 69 (2000) 319–324.
[5] A. Favaro, L. Moro-Velázquez, A. Butala, C. Motley, [17] L. Schettino, M. Mafia, R. De Micco, A. Tessitore,
T. Cao, R. D. Stevens, J. Villalba, N. Dehak, Multilin- Disfluency and speech management in italian
pagual evaluation of interpretable biomarkers to rep- tients with early-stage parkinson’s disease, in:
resent language and speech patterns in parkinson’s Proceedings of Disfluency in Spontaneous Speech
disease, Frontiers in Neurology 14 (2023) 1142642. (DiSS) Workshop 2023, 2023.
[6] C. G. Goetz, B. C. Tilley, S. R. Shaftman, G. T. Steb- [18] L. Moro-Velazquez, J. A. Gomez-Garcia, J. D.
Ariasbins, S. Fahn, P. Martinez-Martin, W. Poewe, C. Sam- Londoño, N. Dehak, J. I. Godino-Llorente, Advances
paio, M. B. Stern, R. Dodel, Movement disorder in parkinson’s disease detection and assessment
ussociety-sponsored revision of the unified parkin- ing voice and speech: A review of the articulatory
son’s disease rating scale (mds-updrs): scale presen- and phonatory aspects, Biomedical Signal
Processtation and clinimetric testing results, Movement ing and Control 66 (2021) 102418.
disorders: oficial journal of the Movement Disorder [19] A. Favaro, Y.-T. Tsai, A. Butala, T. Thebaud, J.
VilSociety 23 (2008) 2129–2170. lalba, N. Dehak, L. Moro-Velázquez, Interpretable
[
        <xref ref-type="bibr" rid="ref14">7</xref>
        ] M. Mafia, L. Schettino, V. N. Vitale, Automatic speech features vs. dnn embeddings: What to use
detection of parkinson’s disease with connected in the automatic assessment of parkinson’s disease
speech acoustic features: towards a linguistically in multi-lingual scenarios, Computers in Biology
interpretable approach, in: Proceedings of the 9th and Medicine 166 (2023) 107559.
      </p>
      <p>Italian Conference on Computational Linguistics. [20] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang,
CEUR WORKSHOP PROCEEDINGS, 2023. J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu,
Con[8] L. O. Ramig, C. Fox, S. Sapir, Speech treatment for former: Convolution-augmented transformer for
parkinson’s disease, Expert review of neurothera- speech recognition, Interspeech 2020 (2020).
peutics 8 (2008) 297–309. [21] A. Pasad, J.-C. Chou, K. Livescu, Layer-wise
analy[9] F. L. Darley, A. E. Aronson, J. R. Brown, Clusters sis of a self-supervised speech representation model,
of deviant speech dimensions in the dysarthrias, in: 2021 IEEE Automatic Speech Recognition and
Journal of speech and hearing research 12 (1969) Understanding Workshop (ASRU), IEEE, 2021, pp.
462–496. 914–921.
[10] J. R. Orozco-Arroyave, F. Hönig, J. D. Arias- [22] N. Vitale, L. Schettino, F. Cutugno, Rich speech
Londoño, J. F. Vargas-Bonilla, S. Skodda, J. Rusz, signal: exploring and exploiting end-to-end
auto</p>
      <p>Declaration on Generative AI
During the preparation of this work, the author(s) used Grammarly in order to: Grammar and
spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>tion phenomena</article-title>
          ,
          <source>in: 25th Annual Conference of the</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>(INTERSPEECH</source>
          <year>2024</year>
          ), ISCA,
          <year>2024</year>
          , pp.
          <fpage>222</fpage>
          -
          <lpage>226</lpage>
          . [23]
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Levelt</surname>
          </string-name>
          , Speaking: From intention to articu-
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>lation</surname>
          </string-name>
          , volume
          <volume>1</volume>
          , Cambridge/London: MIT press,
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <year>1993</year>
          . [24]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Landis</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. G. Koch,</surname>
          </string-name>
          <article-title>The measurement of ob-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          (
          <year>1977</year>
          )
          <fpage>159</fpage>
          -
          <lpage>174</lpage>
          . [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Prasad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jyothi</surname>
          </string-name>
          , How accents confound: Prob-
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>recognition systems</article-title>
          ,
          <source>in: Proceedings of the 58th</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>tional Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>3739</fpage>
          -
          <lpage>3753</lpage>
          . [26]
          <string-name>
            <given-names>V. N.</given-names>
            <surname>Vitale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cutugno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Origlia</surname>
          </string-name>
          , G. Coro, Ex-
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          . [27]
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , J. Schmidhu-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          networks,
          <source>in: Proceedings of the 23rd international</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>conference on Machine learning</source>
          ,
          <year>2006</year>
          , pp.
          <fpage>369</fpage>
          -
          <lpage>376</lpage>
          . [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <article-title>Sequence transduction with recurrent</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>neural networks</article-title>
          ,
          <source>arXiv preprint arXiv:1211.3711</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          (
          <year>2012</year>
          ). [29]
          <string-name>
            <given-names>N.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , I. Sutskever,
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>machine learning research 15</source>
          (
          <year>2014</year>
          )
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          . [30]
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Scherer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Schuller</surname>
          </string-name>
          , J. Sundberg,
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>computing 7</source>
          (
          <year>2015</year>
          )
          <fpage>190</fpage>
          -
          <lpage>202</lpage>
          . [31]
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wöllmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          , Opensmile: the
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>ture extractor</article-title>
          ,
          <source>in: Proceedings of the 18th ACM</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>international conference on Multimedia</source>
          ,
          <year>2010</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          1459-
          <fpage>1462</fpage>
          . [32]
          <string-name>
            <given-names>J.</given-names>
            <surname>Demšar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Curk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Erjavec</surname>
          </string-name>
          , Črt Gorup, T. Hoče-
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>chine Learning Research</source>
          <volume>14</volume>
          (
          <year>2013</year>
          )
          <fpage>2349</fpage>
          -
          <lpage>2353</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>http://jmlr.org/papers/v14/demsar13a.html.</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>