1. Introduction

Cagliari, Italy. * Corresponding author. † These authors contributed equally. $ lschettino@unibz.it (L. Schettino); vincenzonorman.vitale@unina.it (V. N. Vitale); mmafia@unior.it (M. Mafia)

Using End-to-End Automatic Speech Recognisers' Internals to Model Disfluencies in Italian Patients with Early-stage Parkinson's Disease.

Loredana Schettino

Vincenzo Norman Vitale

Marta Mafia

0 Free University of Bozen-Bolzano , Piazza Università, 1, 39100 Bolzano , Italia 1 University of Naples Federico II , C.so Umberto I, 40, 80138 Napoli , Italia 2 University of Naples L'Orientale, Italy , Via Chiatamone 61/62 - 80121 Napoli , Italia

2025

000 0 0002

Alterations in speakers' articulation and phonation are among the earliest symptoms of Parkinsons' Disease (PD). However, clinical decision-making is currently based on holistic ratings of speech intelligibility, while studies on PD detection mostly involve highly complex and hardly interpretable models. This study builds upon previous works on Italian that showed how the characteristics of disfluency phenomena may be considered as an index of impairment at the very onset of the disease by investigating whether even less complex (supervised) end-to-end speech recognition systems (E2E ASR) can model disfluency phenomena in Italian PD speech and how this could support PD discrimination tasks. Exploiting the ability of E2E ASRs to progressively model useful features for discriminating between PD and non-PD speakers provides valuable insight into the ASRs' internal dynamics as well as for the development of decision support systems.

eol>Disfluencies Spontaneous Speech Parkinson's Disease Conformer Probing

1. Introduction

analysis could therefore be beneficial, especially in the diagnostic phase: while a cure for PD has yet to be found, Parkinson’s Disease (PD) is a chronic neurodegenerative early diagnosis is crucial for access to pharmacological disorder steadily on the rise in terms of prevalence and and non-pharmacological interventions. incidence [1, 2]: more than 10 million individuals world- However, developing machine learning tools for critiwide are afected by PD, mainly among the population cal areas like early Parkinson’s disease detection is sigaged 65 and over, and this number is expected to increase nificantly hampered by data scarcity. Acquiring the necin demographically ageing societies. Caused by deteriora- essary data is both costly, requiring specialized linguistion or loss of dopaminergic neurons in the substantia ni- tic and medical experts, and complex, given the inhergra of the basal ganglia, PD is generally diagnosed based ently (and fortunately) small patient sample size. To overon clinical criteria, such as the medical history and phys- come this limitation, our study explores the use of latent ical/neurological examinations of the patient. Although features encoded within pre-trained Automatic Speech several experimental studies have shown that speech and Recognition (ASR) models. This approach explores the voice alterations are among the earliest symptoms of PD possibility of eficiently utilizing limited available data [3, 4, 5], this precious information is poorly used in clini- by leveraging knowledge distilled from vast quantities cal decision-making. In the Unified Parkinson’s Disease of data not originally intended for this purpose. AddiRating Scale (UPDRS), the rating tool used to assess the tionally, to further enhance the procedure, it focuses on severity and to monitor the progression of the disease specific speech features that were observed to play a sig[6], only one item (3.1) concerns the patient’s speech and nificant role in discriminating PD speech, even at the suggests an assessment based on the clinician’s percep- early stages of the disease, namely speech disfluency tion, considering above all intelligibility. The application patterns [ 7 ]. of advanced and sustainable methods of acoustic data We believe that using such a method optimises the use of the available data by integrating domain-specific and computational knowledge and can thus support the development of decision support systems in data-scarce critical contexts, as exemplified by early Parkinson’s detection.

2. Related work

sentations. Furthermore, a more recent study showed that models based on interpretable features such as 2.1. Parkinson’s Disease Speech and prosodic, linguistic, and cognitive descriptors can supDisfluency Patterns port the evaluation of speech deterioration in PD patients, whereas models based on non-interpretable feaThe loss of dopamine in the central nervous system tures achieve higher detection accuracy [19]. These findcauses motor impairments and has an impact also on ings are often based on consideration of highly funclaryngeal, respiratory and articulatory functions, with tional vocal paradigms, such as sustained phonation of about 90% of individuals with PD sufering from voice and isolated segments, and involve rather complex models speech disorders [8]. PD-related hypokinetic dysarthria relying on non-interpretable features or features comincludes a range of alterations: hypophonia (reduced monly observed as useful for PD speech discrimination voice volume), dysphonia (changes in voice quality), dys- as they become evident in the mid to advanced stages rhytmia, reduced speech rate, monopitch, imprecise ar- of the disease [18]. Nonetheless, a recent study showed ticulation [9, 10, 11, 12]. Parkinsonian speech is also com- that PD detection trials relying on a restricted number monly referred to as ‘disfluent’, although a detailed and of meaningful features that were extracted from concomprehensive description of the specific characteristics nected speech rather than isolated speech units achieve of disrupted PD speech has not yet been provided [13]. accurate, as well as economical and interpretable disStudies have mostly focused on specific types of disfluen- crimination [ 7 ]. Also, studies on the interpretability of cies: in [14, 15], for example, stuttering-like disruptions DNN-based models, using probing techniques, provided (one-syllable word repetitions, sound and syllable repeti- evidence that even smaller and less complex models, such tions, sound prolongations, and blocks) were observed in as Conformer-based ones [20], can model speech features PD patients and healthy speakers, and greater disfluency and that diferent features are encoded in DNN layers percentages were found in pathological speech, support- at diferent depths [ 21]. In particular, it was found that ing the relationship between stuttering and the functions higher levels capture phone identity and word identity of the basal ganglia. In a work on repetitive speech phe- information, and the last layer before the object funcnomena (both hyperfluent and dysfluent) [ 16], a positive tion even captures discriminating features of disfluency correlation between the frequency of disfluencies and the phenomena, more specifically, filled pauses and prolonduration of PD was found. However, studies have not al- gations [22]. ways considered the functions of disfluency phenomena In substance, this study builds on the following findin PD speech and mostly involved mild-to-severe and ings from previous work on Italian PD speech: strongly disfluent patients in experimental protocols.

A recent study conducted on Italian early-stage PD • relying on natural speech material that results subjects and on spontaneous monological speech [17], from the usual working dynamic of the vocal apshowed that, even at the beginning of the pathology paratus during phonation proves useful for dis(when patients’ speech is completely intelligible), the crimination [ 7 ]; observation of disfluency phenomena can reveal some al- • peculiar uses of natural speech characteristics teration in linguistic planning and processing: the speech phenomena like disfluency phenomena may be of PD patients was found to difer from that of sex- and considered as an index of impairment at the very age-matched healthy speakers, in terms of the higher onset of the disease [17]; frequency of repairs, the specific functions of hesitations • less complex supervised end-to-end speech recog(mostly used by PD patients for lexical retrieval), the lo- nition systems (E2E ASR) can model disfluencycation of disfluency phenomena (more within-words in related features useful for their discrimination PD speech than in the control group) and the duration [22]. of silent pauses (longer in PD than in healthy subjects).

2.2. Parkinson’s Disease Automatic Detection Various studies have been devoted to developing au

tomatic and objective tools to support PD diagnosis 3. Method and the assessment of its severity [18]. Remarkable PD detection accuracy was achieved by leveraging non- 3.1. Data and Annotation interpretable embeddings obtained with Deep Neural Networks (DNNs)-based self-supervised models, e.g. x- The study is based on the data described in [12]. It convectors, Wav2Vec 2.0, HuBERT, and TRILLsson repre- sists of approximately 40 minutes of monologic speech

On this basis, we investigate whether less complex (supervised) E2E ASR systems can model disfluency features in Italian PD speech and how well this could support PD discrimination tasks.

produced by 36 Italian native speakers from the Campania region: 18 participants with idiopathic non-demented PD (10 males, 8 females; 51–81 years of age, M= 65) and 18 age-matched Healthy Controls (HC, 10 males, 8 females; 54–77 years of age, M= 64). The patients were recruited from the Movement Disorders Unit of the First Division of Neurology at the University of Campania “Luigi Vanvitelli”. PD participants had no prior history of language or speech disorders, had been diagnosed with Parkinson’s disease within the past four years (since 2021) and showed no significant cognitive impairment, major or minor depression, or dysthymic disorder. All participants were asked to discuss the positive and negative aspects of the place they were living during data collection. They were encouraged to speak in their usual, conversational tone and at a comfortable volume. Sociolinguistic information for each speaker was gathered via a questionnaire, and all participants provided written consent for the data collection process.

The analysis focused on a series of so-called “disfluency phenomena” defined as speech management phenomena, namely, speech material, e.g. repetitions, segmental prolongations, pauses, and fillers that speakers can use to monitor and efectively manage the online processes of speech planning, coding, articulation, and reception [23]. The phenomena were identified and annotated based on their context of occurrence, following [17] and included the following phenomena specifically involved in the speech planning process (Cohen’s k= 0.82, good agreement [24]): • Prolongations (PRLs), marked prolongation of segmental material, e.g., laaa casa (theee house); • Filled Pauses (FPs), non-lexical filler, vocalizations and/or nasalizations, e.g., eeh, ehm, mhh; • Silent Pauses (SPs), marked silences perceived as a hesitant pause in the context of occurrence; • Lexicalized Filled Pauses (LFPs), lexical fillers, work as discourse markers involved in the coverage of planning times, e.g., diciamo, insomma, appunto... (well, let’s say, so, ...); • Repetitions (REPs), repetition of already uttered words or fragments of words, e.g., di di (of of) or d- di (o- of).

• selecting pre-trained models (). In particular, two publicly available Conformer-based [20] models with diferent decoding component were selected: one with a Connectionist Temporal Classification (CTC) [ 27] decoder1, namely, a non-auto-regressive technique; one with a Recurrent Neural Network Transducer (RNN-T), commonly known as Transducer2, which is an autoregressive speech transcription technique; • building Long Short Term Memory (LSTM) and Bidirectional LSTM (BILSTM) classifiers whose inputs are represented by intermediate emissions of the considered model’s encoder layers (), combined with the appropriate sequence of labels based on dataset annotation; • evaluating the classifications relying on metrics oriented to results safety rather than performance. 3.2.1. Data Preparation

The considered dataset has been prepared based on a set

of praat TextGrid annotation files indicating the speaker 3.2. Probing Approach and the type of disfluency according to the speech signal. Based on previous studies investigating E2E-ASR models’ More specifically, PRLs, FPs, SPs, LFPs and REPs were internal behaviour [21, 25, 22, 26], we employ a probing considered, resulting in a dataset with a dimension of approach to investigate the pre-trained models’ ability to 850 segments. For each segment, the contextual informacapture speaker and speech related markers, i.e., charac- tion preceding and following the disfluency phenomenon teristics associated with disfluent speech segments com- has been considered, giving each segment a length of 4 bined with PD biomarkers, and whether these features facilitate PD speech identification.

The employed technique involves:

1v1.6.0 https://huggingface.co/nvidia/stt_en_conformer_ctc_large 2v1.6.0 https://huggingface.co/nvidia/stt_en_conformer_transducer_large

seconds. Then, for each encoding layer from a consid- 3.2.2. Pre-trained Models ered pre-trained model, we extract a representation of segments in the corresponding latent space following the procedure described in Figure 1. In particular, for each segment, we obtain: We selected two publicly available Conformer-based [20] pre-trained models built with the NVIDIA Nemo toolkit3, both with a fragment dimension = 40 and only difering in the decoding component. • A sequence of intermediate emissions, namely frag- On the one hand, we considered a CTC decoder, one ment representations in the corresponding layer’s of the most popular decoding techniques. It consists latent space. Each fragment corresponds to a por- of a non-auto-regressive speech transcription technique tion of milliseconds of the input signal, where that collapses consecutive, all-equal, transcription labels depends on the considered model’s characteris- (character, word piece, etc.) to one label unless a special tics. label separates them. The result is a sequence of labels • A sequence of labels associated with each frag- shorter than or equal to the input vector sequence length. ment, indicating whether that fragment belongs Being non-auto-regressive, it is also considered computo a disfluency or not and, if so, whether the tationally efective, requiring less time and resources for speaker is PD or HC. training and inference phases. On the other hand, we considered a Transducer, which is an auto-regressive speech transcription technique that overcomes CTC’s limitations, being non-auto-regressive and subject to limited label sequence length. The Transducer decoding technique can produce label-transcription sequences longer than

The resulting dataset consists of pairs of sequences of emissions (i.e., distilled features) and corresponding labels identified by the model and the layer from which they were extracted. 3Nemo version 1.21.0. the input vector sequence and models inter-dependency trained for a maximum of 100 epochs using an Adam opin long-term transcription elements. A Transducer typ- timizer with an initial = 0.00001. To reduce the risk of ically comprises two sub-decoding modules: one that overfitting, we introduce a dropout neuron-selection stratforecasts the next transcription label based on the pre- egy for the LSTM/BILSTM gates, which statistically exvious transcriptions (prediction network) and the other cludes (with a 0.1 probability) one neuron and its weights that combines the encoder and prediction-network out- during each training iteration [29]. Finally, an early stopputs to produce a new transcription label (joiner net- ping mechanism was used to avoid wasting computawork). These features improve transcription speed and tional resources. In particular, the training phase ends performance (compared to CTC), while requiring more if the validation-loss does not decrease by a minimum training and computational resources [28]. Also, the two of 0.001 during the last 20 epochs, which is the patience techniques should provide diferent representations and threshold. contributions during the training phase (backdrop) due to their diferent dynamics in forward propagation. 3.2.4. Evaluation

Note that both considered pre-trained models rely on the same encoder architecture, but the Conformer-CTC model has 18 encoding layers, while the ConformerTransducer encoder has 17 layers. This resulted in 35 diferent latent space representations for the considered dataset.

Since our aim is to investigate whether pre-trained E2E

ASR models encode features useful for the identification of disfluency phenomena in PD speech, and whether and how they enable the discrimination between PD and HC, we decided to rely on metrics oriented to results safety rather than performance. Note that a sample is classified 3.2.3. Classifiers as PD or HC if the portion related to the disfluency is (1) detected and (2) at least 60% of frames are correctly The classifiers internally consist of a LSTM or a BILSTM labeled as either PD or HC. The reliability of the apmodule, followed by a Feed Forward Neural Network proach is assessed by inspecting the confusion matrices (FFNN). The choice of LSTM and BILSTM modules is for the best LSTM or BiLSTM, CTC-based and RNNTdriven by their capacity to capture the temporal depen- based classifiers, which provide a breakdown of correct dencies in the input, which fits well with our objective and incorrect predictions for each class. The quantitative of modeling temporal dependencies in the latent space analysis was further supported by a qualitative explorepresentation of the speech signal. ration of the acoustic features emerging as relevant for

Since the LSTM/BILSTM hidden-layer size is a crucial discrimination with reference to previous literature [18]. parameter, we investigate the impact of three diferent To this aim, the eGeMAPSv02 [30] feature set from the layer sizes (hidden-layer size, ℎ), namely 160, 320 and OpenSmile toolkit [31] was selected as the basic feature 640. So, an LSTM-based classifier processes a sequence of set and inspected using the Orange software [32]. {,} emission vectors (each of length ) and produces a new sequence of vectors with size ℎ. The two sequences are aligned over time. At each time step , based on the 4. Results and Discussion LSTM/BILSTM hidden-layer output, the FFNN produces a label indicating whether the considered input repre- In this study, we considered two distinct ASR arsents a disfluency segment, pronounced by a PD or HC chitectures, namely Conformer-CTC and Conformerspeaker, or not. In summary, we train and evaluate many Transducer, selected for the difering capabilities of bdlieferℎen,tR,NanNdclacsosmifiebrsi/ndaettieocntsortos (searcℎh,fo,r)tfhoer eavllidpeonscsei- tnhisemirs,die.ec.o, dCiTnCg acnomdTproannesndtusc.erT,hbeeisnegdreecsopdecintigvemlyencohna-of disfluencies-related pathological biomarker proper- autoregressive and autoregressive by nature, are inherties in the models’ decisions.footnoteFrameworks used ently capable of capturing diverse aspects of the speech to implement and train the classifiers: torch==2.2.1 and signal, therefore influencing in a diferent way the enpytorch-lightning==2.0.7 and BILSTM based classifier coding component. wteemrepotrraailnseedn,srietisvuilttyinogf ionu∼ r c2l0a0ssmifieor/ddeeltse. cNtoort,entahmateltyhe ifer.FIitgiusrient2erreesptoinrtgs ttoheobpsreercviesihoonwoftheeaclhayterrasincelodsecrlatsosithe minimum diference between consecutive time steps, the input provide the higher precision, while the overall is 40 ms because the considered ASR models produce tendency, getting closer to the decoding component, is a emissions at that rate. constant reduction, which is likely related to the specific

During the training phase, the considered corpus was objective of the pre-trained models, namely, to provide split into train, validation, and test sets using 60%, 20%, a clean transcription. However, the model that seems to and 20% percentages while ensuring that these sets did provide the most informative and stable latent represennot share the same speakers. Each classifier has been tation seems to be the Conformer-CTC (orange line) in (a) LSTM with hidden size ℎ = 320 trained on the dataset represented in the latent space of Conformer-CTC’s encoding layer #2. (b) BiLSTM with hidden size ℎ = 160 trained on the dataset represented in the latent space of Conformer-CTC’s encoding layer #2.

5. Conclusion

The main findings highlight that focusing on speech correlates such as disfluency phenomena provides a convenient choice to enhance the development of decision support systems. The latent representation from the intermediate encoding layer was shown to be highly informative and quite reliable for the classification of PD and HC speakers. In addition, the CTC decoder seemed to provide slightly more stable performance in this task, probably due to its non-autoregressive nature. Future steps will involve a comparison of performance with different pre-trained models and classification architectures.

Indeed, since disfluency phenomena encompass diferent types of phenomena (i.e. textual phenomena, like lexical ifllers and repetitions, and phonetic phenomena, like nonlexical fillers and prolongations), diferent approaches may perform better on specific types.

Also, the analysis led to observing (not yet noticeable) alterations of acoustic features revealing the onset of PD-related motor impairment. It is worth noticing that some of the considered disfluency phenomena, namely prolongations and filled particles, consist in prolonged vocalisations, which are similar to the sustained vowel traditionally used in highly controlled studies on PD speech. Thus they provide a nice integration of data E. Nöth, Automatic detection of parkinson’s disease eficacy and ecology. from words uttered in three diferent languages, in: Fifteenth annual conference of the international speech communication association, 2014.

References [11] H. Ackermann, W. Ziegler, Articulatory deficits in parkinsonian dysarthria: an acoustic analysis., [1] W. A. Rocca, The burden of parkinson’s disease: a Journal of Neurology, Neurosurgery & Psychiatry worldwide perspective, The Lancet Neurology 17 54 (1991) 1093–1098.

(2018) 928–929. [12] M. Mafia, R. De Micco, M. Pettorino, M. Siciliano, [2] J. D. Steinmetz, K. M. Seeher, N. Schiess, E. Nichols, A. Tessitore, A. De Meo, Speech rhythm variation in B. Cao, C. Servili, V. Cavallera, E. Cousin, H. Hagins, early-stage parkinson’s disease: a study on diferent M. E. Moberg, Global, regional, and national burden speaking tasks, Frontiers in Psychology 12 (2021) of disorders afecting the nervous system, 1990– 668291. 2021: a systematic analysis for the global burden of [13] A. M. Goberman, M. Blomgren, Parkinsonian disease study 2021, The Lancet Neurology 23 (2024) speech disfluencies: efects of l-dopa-related fluctu344–381. ations, Journal of fluency disorders 28 (2003) 55–70. [3] S. Skodda, Analysis of voice and speech perfor- [14] A. M. Goberman, M. Blomgren, E. Metzger, Characmance in parkinson’s disease: a promising tool for teristics of speech disfluency in parkinson disease, the monitoring of disease progression and diferen- Journal of Neurolinguistics 23 (2010) 470–478. tial diagnosis, Neurodegenerative Disease Manage- [15] F. S. Juste, F. C. Sassi, J. B. Costa, C. R. F. de Andrade, ment 2 (2012) 535–545. Frequency of speech disruptions in parkinson’s dis[4] J. Rusz, R. Cmejla, T. Tykalova, H. Ruzickova, ease and developmental stuttering: a comparison J. Klempir, V. Majerova, J. Picmausova, J. Roth, among speech tasks, Plos one 13 (2018) e0199054. E. Ruzicka, Imprecise vowel articulation as a po- [16] T. Benke, C. Hohenstein, W. Poewe, B. Butterworth, tential early marker of parkinson’s disease: Efect Repetitive speech phenomena in parkinson’s disof speaking task, The Journal of the Acoustical ease, Journal of Neurology, Neurosurgery & PsySociety of America 134 (2013) 2171–2181. chiatry 69 (2000) 319–324. [5] A. Favaro, L. Moro-Velázquez, A. Butala, C. Motley, [17] L. Schettino, M. Mafia, R. De Micco, A. Tessitore, T. Cao, R. D. Stevens, J. Villalba, N. Dehak, Multilin- Disfluency and speech management in italian pagual evaluation of interpretable biomarkers to rep- tients with early-stage parkinson’s disease, in: resent language and speech patterns in parkinson’s Proceedings of Disfluency in Spontaneous Speech disease, Frontiers in Neurology 14 (2023) 1142642. (DiSS) Workshop 2023, 2023. [6] C. G. Goetz, B. C. Tilley, S. R. Shaftman, G. T. Steb- [18] L. Moro-Velazquez, J. A. Gomez-Garcia, J. D. Ariasbins, S. Fahn, P. Martinez-Martin, W. Poewe, C. Sam- Londoño, N. Dehak, J. I. Godino-Llorente, Advances paio, M. B. Stern, R. Dodel, Movement disorder in parkinson’s disease detection and assessment ussociety-sponsored revision of the unified parkin- ing voice and speech: A review of the articulatory son’s disease rating scale (mds-updrs): scale presen- and phonatory aspects, Biomedical Signal Processtation and clinimetric testing results, Movement ing and Control 66 (2021) 102418. disorders: oficial journal of the Movement Disorder [19] A. Favaro, Y.-T. Tsai, A. Butala, T. Thebaud, J. VilSociety 23 (2008) 2129–2170. lalba, N. Dehak, L. Moro-Velázquez, Interpretable [ 7 ] M. Mafia, L. Schettino, V. N. Vitale, Automatic speech features vs. dnn embeddings: What to use detection of parkinson’s disease with connected in the automatic assessment of parkinson’s disease speech acoustic features: towards a linguistically in multi-lingual scenarios, Computers in Biology interpretable approach, in: Proceedings of the 9th and Medicine 166 (2023) 107559.

Italian Conference on Computational Linguistics. [20] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, CEUR WORKSHOP PROCEEDINGS, 2023. J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, Con[8] L. O. Ramig, C. Fox, S. Sapir, Speech treatment for former: Convolution-augmented transformer for parkinson’s disease, Expert review of neurothera- speech recognition, Interspeech 2020 (2020). peutics 8 (2008) 297–309. [21] A. Pasad, J.-C. Chou, K. Livescu, Layer-wise analy[9] F. L. Darley, A. E. Aronson, J. R. Brown, Clusters sis of a self-supervised speech representation model, of deviant speech dimensions in the dysarthrias, in: 2021 IEEE Automatic Speech Recognition and Journal of speech and hearing research 12 (1969) Understanding Workshop (ASRU), IEEE, 2021, pp. 462–496. 914–921. [10] J. R. Orozco-Arroyave, F. Hönig, J. D. Arias- [22] N. Vitale, L. Schettino, F. Cutugno, Rich speech Londoño, J. F. Vargas-Bonilla, S. Skodda, J. Rusz, signal: exploring and exploiting end-to-end auto

Declaration on Generative AI During the preparation of this work, the author(s) used Grammarly in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

tion phenomena , in: 25th Annual Conference of the

(INTERSPEECH 2024 ), ISCA, 2024 , pp. 222 - 226 . [23]

W. J.

Levelt , Speaking: From intention to articu-

lation , volume 1 , Cambridge/London: MIT press,

1993 . [24]

J. R.

Landis , G. G. Koch,

The measurement of ob-

( 1977 ) 159 - 174 . [25]

Prasad ,

Jyothi , How accents confound: Prob-

recognition systems , in: Proceedings of the 58th

tional Linguistics , 2020 , pp. 3739 - 3753 . [26]

V. N.

Vitale ,

Cutugno ,

Origlia , G. Coro, Ex-

( 2024 ) 1 - 27 . [27]

Graves ,

Fernández ,

Gomez , J. Schmidhu-

networks, in: Proceedings of the 23rd international

conference on Machine learning , 2006 , pp. 369 - 376 . [28]

Graves , Sequence transduction with recurrent

neural networks , arXiv preprint arXiv:1211.3711

( 2012 ). [29]

Srivastava ,

Hinton ,

Krizhevsky , I. Sutskever,

machine learning research 15 ( 2014 ) 1929 - 1958 . [30]

Eyben ,

K. R.

Scherer ,

B. W.

Schuller , J. Sundberg,

computing 7 ( 2015 ) 190 - 202 . [31]

Eyben ,

Wöllmer ,

Schuller , Opensmile: the

ture extractor , in: Proceedings of the 18th ACM

international conference on Multimedia , 2010 , pp.

1459- 1462 . [32]

Demšar ,

Curk ,

Erjavec , Črt Gorup, T. Hoče-

chine Learning Research 14 ( 2013 ) 2349 - 2353 . URL:

http://jmlr.org/papers/v14/demsar13a.html.