1. Introduction

Sensitivity of Syllable-Based ASR Predictions to Token Frequency and Lexical Stress

Alessandro Vietti

Domenico De Cristofaro

Sara Picciau

0 0 Free University of Bozen-Bolzano, Libera Università di Bolzano

Automatic Speech Recognition systems (ASR) based on neural networks achieve great results, but it remains unclear which are the linguistic features and representations that the models leverage to perform the recognition. In our study, we used phonological syllables as tokens to fine-tune an end-to-end ASR model due to their relevance as linguistic units. Furthermore, this strategy allowed us to keep track of diferent types of linguistic features characterizing the tokens. The analysis of the transcriptions generated by the model reveals that factors such as token frequency and lexical stress have a variable impact on the prediction strategies adopted by the ASR system.

eol>Automatic Speech Recognition Syllable Phonology

1. Introduction

tion. To understand how the ASR processes syllables and words diferently, we developed a fine-grained linguistic annotation system. This approach was essential to move beyond the limitations of purely numerical metrics like Word-Error-Rate or, in our context, Token-Error-Rate. By employing this system, we could accurately categorize prediction types and link them with specific linguistic aspects of speech. We utilized Multiple Correspondence Analysis and Multinomial Logistic Regression to explore and uncover patterns that relate the neural network’s output behavior to the linguistic factors.

The syllable is crucial in the process of spoken word recognition. It serves as an integral component within the prosodic system because it encompasses both traditional segmental and suprasegmental levels, facilitating the extraction of lexical and syntactic structures from acoustic information [1, 2]. Specifically, the syllable serves as the linguistic unit where crucial information for speech segmentation, rhythmic patterns, and lexical access is encoded [3]. In the field of Automatic Speech Recognition (ASR), graphemic segment has traditionally been the primary unit of processing. However, recent studies endorse the use of syllables or phonetic units of simi- 2. Methodology lar duration as an alternative strategy [4, 5, 6]. In latest ASR research employing Transformer-based neural mod- 2.1. Data preparation and experimental els, the role of syllables is investigated both as tokens for setup word recognition and as components influencing internal speech representations within neural networks [7, 8, 9]. The preparation of the experiment started with the colIn our study, a neural ASR model was trained to process lection of the data to fine-tune the pre-trained Microsoft and recognize phonological syllables, integrating them model WavLM-large [10]. Our dataset consists of approxinto word structures. Our goal is to conduct a linguis- imately 30 hours of Italian data from the crowd-sourced tic analysis on the output of syllabic processing by the corpus Common Voice [11], using 6,500 samples (5,000 speech recognition system. Through fine-tuning a large for training, 500 for testing, and 1,000 for validation). acoustic model, the study mapped speech signals onto The total Italian subset in Common Voice 13.0 comprises phonological transcriptions segmented into syllables and 6,881 speakers and spans approximately 343 hours of words. The primary objective of our linguistic analy- recorded speech. Since we are interested in observing sis is to test the efect of syllable token frequency and the role that some phonological aspects might play in lexical stress on the accuracy of output neural representa- the recognition process, we used WebMAUS [12] to obtain X-SAMPA transcriptions of the corpus. In addition, CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, we forced the model to recognize phonological syllables Dec 04 — 06, 2024, Pisa, Italy as tokens, instead of automatically generated subwords * Corresponding author. based on probability, frequency and likelihood [13]. We † These authors contributed equally. designed a custom tokenizer that relies on the Maximal $ Alessandro.Vietti@unibz.it (A. Vietti); dodecristofaro@unibz.it Onset Principle [14] and the Sonority Sequencing Prin(D. D. Cristofaro); sapicciau@unibz.it (S. Picciau) ciple [15] and considers exceptionally /s/+stop clusters 0000-0002-4166-540X (A. Vietti) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License and geminates as part of the syllable onset [16, 17]. In Attribution 4.0 International (CC BY 4.0). order to observe the placement of the recognized tokens and adjacent Pws and viceversa. If a (partial) match is and word boundaries in detail, we set the output format found, the word-level PT is appended to the correspondof the model so that tokens are separated by blank spaces ing tokens; otherwise, unmatched words are labelled as and words are separated by pipes, as it can be seen in inserted (when not found in Rs) or deleted (when not example (1) found in Pt). Once word-level matches are identified, the algorithm proceeds with the comparison of each Rt and (1) il | vwO to | a sso lu to | Pt within Rw and Pw respectively, and it then assigns the corresponding PT at a token level. The mechanism to find token matches within words and assign token2.2. Creation of the database level PT is analogous to the one described above. The implementation of this algorithm allowed us to automatOnce we tested the model and obtained the predictions, ically annotate most part of the dataset. However, many we extracted a sample of 300 pairs of reference and pre- entries required manual intervention, as in the cases of dicted sentences (Rs and Ps, respectively). The detailed assimilation or predictions characterized by a very low observation of the pairs allowed us to define a set of pre- quality, which resulted in significant mismatches. Lastly, diction types. Word-level prediction types are those that we added to our dataset some phonological information afect canonical word boundaries and consist of three about each token in order to conduct our linguistic analcategories: merged words, meaning two reference words ysis. We included relative frequency of Rt in the whole recognized as one; divided words, consisting of a single dataset used for the training and lexical stress, as well reference word recognized in two or more words; and as presence of the token in the training vocabulary, POS token movement, namely the change of a reference token of Rw, and Rs speech rate. However, only the first two position within adjacent word boundaries. At a token variables were taken into consideration for the statistic level, prediction types represent deviances in terms of analysis in this work. token insertion, substitution and deletion, as well as correctly recognized tokens. We then designed a set of labels (prediction tags PT - see Appendix A.1) representing the 3. Results prediction types to annotate the tokens of our dataset.

The labels consist of a sequence of afixes indicating the 3.1. Explorative analysis detected recognition events. Word-level afixes are mer, div, mv and, in case of token movement, forw or back to mark the direction of the shift; token level afixes are ins, sub, del, eq. Lastly, the sufix syl or word indicates if the phenomenon regards an individual token or the whole word. An example of our annotation can be seen below.

To analyze our prediction database, we first looked at the distribution of prediction types. Next, we used Multiple Correspondence Analysis (MCA) to explore the relationships between prediction types, token frequency, presence in the training vocabulary, and lexical stress.

The syllable-based fine-tuned ASR model showed a high degree of accuracy in prediction, with only 28% of tokens having notable recognition errors, making eq_syl the most frequent category.

The following figures show the detailed distribution of marked prediction types. Our structured labeling system

Given our dataset size of approximately 5900 tokens, allows us to separately examine token-level phenomena a manual annotation of each entry would have been ex- and those afecting sentence structure due to word boundtremely time-consuming. Therefore, we designed an al- ary errors. Figure 1 highlights that substitution is the gorithm to operate a comparison of reference and pre- most common token-level operation, followed by deledicted tokens (Rt and Pt, respectively) with the aim to tion and insertion. This means that most incorrectly recobtain a semi-automated PT labeling. The algorithm ognized tokens still appear in the model’s hypothesized works as follows: first, it attempts to identify the corre- transcription. However, token deletions and insertions spondences between reference and predicted words (Rw, (including entire words like prepositions, determiners, Pw) despite potential mismatches given by prediction or auxiliary verbs) lead to more significant recognition types afecting word boundaries. Each pair of sentences discrepancies. It should be noted that the use of automatiis split into words, and a function to calculate similar- cally generated phonological transcriptions as references ity based on Levenshtein distance is used to confirm or increases the number of substitutions due to speech varidismiss word matches. If the similarity score is lower ability in the corpus. than the established threshold, it indicates a mismatch. Figure 2 shows the distribution of operation/equality When this occurs, similarity is calculated between Rw tags afecting canonical word boundaries. Merging is the most frequent process, involving 401 tokens, followed by divided words with 206 occurrences, and movement of single tokens with 48 instances. The movement label applies to single tokens, unlike other categories. Tokens in merged and divided words were mostly recognized correctly, with substitution being the second most common operation. Token deletion occurs more often in merged words, while token insertion is higher in divided words.

For moved tokens, the distribution of equal and substituted tokens is nearly identical. Deletions and insertions do not apply to moved tokens since they can’t be missing or added in the prediction. quency” (0-0.5%), from one-third to two-thirds is “mid frequency” (0.5-2.23%), and from two-thirds to one is “high frequency” (2.23-6.87%). Part of speech (POS) and syllable type (tok_type_R) were added later as supplementary variables to guide linguistic interpretation of the analysis. Insertion, being the least frequent operation, and complex syllable types (like CCVCC) were excluded due to their low frequency.

MCA is a dimensionality reduction technique for categorical variables, so the significance of the dimensions is derived from the distribution of the levels of the variables projected onto the plane. Interestingly, the top section shows that unstressed high-frequency tokens (over 2.23%), mainly subordinating conjunctions and determiners, are associated with deletion. The bottom-left section includes mid-frequency items (0.5% - 2.23%) with simple syllabic structures (CV) that are typically recognized correctly. Tokens with low frequency or which are absent from the training vocabulary are on the right side of the MCA chart. These less frequent, complex syllable tokens, often occurring in proper nouns and numerals, are typically handled with substitution.

3.2. Multinomial analysis

To statistically validate the findings from the MCA (figure 3), we conducted a multinomial logistic regression analysis using the nnet R library. The model examines the interaction between token frequency and lexical stress and, in this analysis, expresses the regression coeficients in odds (instead of logits) (see Appendix A.2). By looking at the plots of the model predictions and jointly evaluating the pairwise comparisons from the two tables (see Appendix A.4 and A.3), we can get a clearer interpretation of the results of the regression analysis. In Figure 4, we notice that when the prediction is equal to the reference, token frequency has a significant efect in the case of stressed syllables, whereas it appears to be less statistically relevant for unstressed syllables. Additionally, the diference in the presence or absence of lexical accent becomes significant as the frequency increases from low to mid to high. Regarding substitution, the patterns seem complementary to those observed in the Figure 2: Count of deviations at a word level matching of reference and prediction (i.e., in the equal plot). When syllables have a low frequency in the dataset, the probability that they are replaced with other syllabic

Figure 3 shows the Multiple Correspondence Anal- tokens significantly increases. Although we have not ysis (MCA) results using the FactoMinerR R package. explored which syllabic tokens or types they are replaced This analysis reveals patterns between prediction types with and based on what criteria, it is safe to assume that (event_syllable), token frequency (freq_tok_R_cat), pres- it may be due to phonetic similarity. Specifically, there ence in the training vocabulary (in_vocab_R), and lexical is a significant diference only between low frequency stress (stress_R). The relative frequency of tokens in the and the combined mid and high frequencies for both dataset was discretized into three levels using quantiles stressed and unstressed syllables. As for deletion, the to obtain a uniform distribution of tokens across the three regression coeficients reveal that the probability of delecategories: from zero to one-third of tokens is “low fre- tion of unstressed syllables increases with frequency, but only in the transition from low to medium frequency, tuning a neural ASR model to process and recognize with no further increase from medium to high frequency. phonological syllables, we were able to conduct a detailed For stressed syllables, the neutralization of a frequency linguistic analysis of its output. Our findings indicate that efect is confirmed from the analysis of the coeficient. syllable frequency and lexical stress significantly impact A quick exploration of the most deleted mid-frequency ASR accuracy. Specifically, stressed syllables are more syllables shows that the preposition ’a’ or V syllables in accurately recognized than unstressed ones, especially as word-initial position are more likely deleted. frequency increases. Contrary to our expectation, among the low-frequency syllables, stressed tokens are more prone to substitution, whereas mid-frequency unstressed 4. Conclusions and future work ones are more susceptible to deletion. This demonstrates the neural model’s sensitivity to both distributional inforThis study provides insights into the role of syllables in mation in the dataset and phonological information and ASR performance, particularly when integrating phono- highlights the model’s ability to detect varying syllabic logical information into the recognition process. By fine- prominence at the lexical level within the signal. As future work, we plan to include other linguistic factors as in- [9] V. N. Vitale, F. Cutugno, A. Origlia, G. Coro, dependent variables to refine our analysis. An interesting Exploring emergent syllables in end-to-end auapproach is to evaluate the impact of unstressed syllables tomatic speech recognizers through model exand specific parts of speech by conducting an analysis plainability technique, Neural Computing exclusively on content words. Furthermore, we aim to and Applications 36 (2024) 6875–6901. URL: investigate in detail syllable substitution in relation to https://doi.org/10.1007/s00521-024-09435-1. doi:10. token frequency and phonetic similarity to compare the 1007/s00521-024-09435-1. weight of each factor whenever this strategy is adopted to [10] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, deal with low-frequency tokens. In conclusion, our study J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, showed the influence of token frequency and prominence S. Ren, Y. Qian, Y. Qian, M. Zeng, X. Yu, F. Wei, in ASR predictions while demonstrating that complex Wavlm: Large-scale self-supervised pre-training computational tools, like modern neural networks, can for full stack speech processing, IEEE Journal of be efectively utilized by linguists to simulate and test Selected Topics in Signal Processing 16 (2022) 1–14. linguistically relevant hypotheses. doi:10.1109/JSTSP.2022.3188113. [11] R. Ardila, M. Branson, K. Davis, M. Henretty,

M. Kohler, J. Meyer, G. Weber, Common voice: References A massively-multilingual speech corpus, arXiv (2020). URL: https://doi.org/10.48550/arXiv.1912. [1] M. E. Beckman, The parsing of prosody, Language and Cognitive Processes 11 (1996) 17–68. URL: https: [12] 0F.6S6c7h0.iedlo,iA:1s0t.a4ti8st5ic5a0l/maordXeilvfo.r1p9r1e2d.ic0t6in6g7p0r.onun//doi.org/10.1080/016909696387213. doi:10.1080/ ciation, in: Proceedings of the ICPhS 2015, Glasgow, 016909696387213. UK, 2015, p. paper 195. [2] S. Hawkins, R. Smith, Polysp: A polysystemic, [13] T. Kudo, Subword regularization: Improving neuphonetically-rich approach to speech understand- ral network translation models with multiple subing, Italian Journal of Linguistics 13 (2001) 99–189. word candidates, in: Proceedings of the 56th An[3] J. M. McQueen, L. Dilley, Prosody and spoken-word nual Meeting of the Association for Computational recognition, in: C. Gussenhoven, A. Chen (Eds.), Linguistics (Volume 1: Long Papers), Association The Oxford Handbook of Language Prosody, 2021, for Computational Linguistics, Melbourne, Auspp. 508–521. tralia, 2018, pp. 66–75. URL: http://arxiv.org/abs/ [4] S. Greenberg, Speaking in shorthand—a syllable- 1804.10959.

centric perspective for understanding pronuncia- [14] D. Kahn, Syllable-based generalizations in English tion variation, Speech Communication 29 (1999) phonology, Ph.D. thesis, Massachusetts Institute 159–176. of Technology, 1976. URL: https://dspace.mit.edu/ [5] N. Morgan, H. Bourlard, H. Hermansky, Automatic handle/1721.1/16397.

speech recognition: An auditory perspective, in: [15] G. N. Clements, The role of the sonority cyS. Greenberg, W. A. Ainsworth, A. N. Popper, R. R. cle in core syllabification, in: J. Kingston, M. E. Fay (Eds.), Speech Processing in the Auditory Sys- Beckman (Eds.), Papers in Laboratory Phonoltem, Springer, New York, 2004, pp. 309–338. ogy: Volume 1: Between the Grammar and [6] G. Coro, F. V. Massoli, A. Origlia, F. Cu- Physics of Speech, volume 1, Cambridge Unitugno, Psycho-acoustics inspired automatic versity Press, 1990, pp. 283–333. URL: https: speech recognition, Computers & Electrical En- //doi.org/10.1017/CBO9780511627736.017. doi:10. gineering 93 (2021) 107238. URL: https://doi.org/10. 1017/CBO9780511627736.017. 1016/j.compeleceng.2021.107238. doi:10.1016/j. [16] G. Marotta, L. Vanelli, Fonologia e prosodia compeleceng.2021.107238. dell’italiano, Carocci editore, 2021. [7] C. S. Anoop, A. G. Ramakrishnan, Suitability [17] M. Krämer, The Phonology of Italian, Oxford Uniof syllable-based modeling units for end-to-end versity Press, Oxford, New York, 2009. speech recognition in sanskrit and other indian languages, Expert Systems with Applications 220 (2023) 119722. URL: https://doi.org/10.1016/ j.eswa.2023.119722. doi:10.1016/j.eswa.2023.

119722. [8] C. J. Cho, A. Mohamed, S.-W. Li, A. W. Black, G. K.

Anumanchipalli, Sd-hubert: Sentence-level selfdistillation induces syllabic organization in hubert, arXiv (2024). URL: http://arxiv.org/abs/2310.10803.

A. Appendix A.1. Prediction types (PT)

Label eq_syl sub_syl ins_syl del_syl sub_syl_word ins_syl_word del_syl_word mv_eq_forw_syl mv_sub_forw_syl mv_eq_back_syl mv_sub_back_syl div_eq_syl div_sub_syl div_ins_syl mer_eq_syl mer_sub_syl mer_ins_syl mer_del_syl