<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sensitivity of Syllable-Based ASR Predictions to Token Frequency and Lexical Stress</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessandro Vietti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico De Cristofaro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sara Picciau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Free University of Bozen-Bolzano, Libera Università di Bolzano</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Automatic Speech Recognition systems (ASR) based on neural networks achieve great results, but it remains unclear which are the linguistic features and representations that the models leverage to perform the recognition. In our study, we used phonological syllables as tokens to fine-tune an end-to-end ASR model due to their relevance as linguistic units. Furthermore, this strategy allowed us to keep track of diferent types of linguistic features characterizing the tokens. The analysis of the transcriptions generated by the model reveals that factors such as token frequency and lexical stress have a variable impact on the prediction strategies adopted by the ASR system.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Automatic Speech Recognition</kwd>
        <kwd>Syllable</kwd>
        <kwd>Phonology</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>tion. To understand how the ASR processes syllables and
words diferently, we developed a fine-grained linguistic
annotation system. This approach was essential to move
beyond the limitations of purely numerical metrics like
Word-Error-Rate or, in our context, Token-Error-Rate. By
employing this system, we could accurately categorize
prediction types and link them with specific linguistic
aspects of speech. We utilized Multiple Correspondence
Analysis and Multinomial Logistic Regression to explore
and uncover patterns that relate the neural network’s
output behavior to the linguistic factors.</p>
      <p>The syllable is crucial in the process of spoken word
recognition. It serves as an integral component within the
prosodic system because it encompasses both traditional
segmental and suprasegmental levels, facilitating the
extraction of lexical and syntactic structures from acoustic
information [1, 2]. Specifically, the syllable serves as
the linguistic unit where crucial information for speech
segmentation, rhythmic patterns, and lexical access is
encoded [3]. In the field of Automatic Speech
Recognition (ASR), graphemic segment has traditionally been
the primary unit of processing. However, recent studies
endorse the use of syllables or phonetic units of simi- 2. Methodology
lar duration as an alternative strategy [4, 5, 6]. In latest
ASR research employing Transformer-based neural mod- 2.1. Data preparation and experimental
els, the role of syllables is investigated both as tokens for setup
word recognition and as components influencing internal
speech representations within neural networks [7, 8, 9]. The preparation of the experiment started with the
colIn our study, a neural ASR model was trained to process lection of the data to fine-tune the pre-trained Microsoft
and recognize phonological syllables, integrating them model WavLM-large [10]. Our dataset consists of
approxinto word structures. Our goal is to conduct a linguis- imately 30 hours of Italian data from the crowd-sourced
tic analysis on the output of syllabic processing by the corpus Common Voice [11], using 6,500 samples (5,000
speech recognition system. Through fine-tuning a large for training, 500 for testing, and 1,000 for validation).
acoustic model, the study mapped speech signals onto The total Italian subset in Common Voice 13.0 comprises
phonological transcriptions segmented into syllables and 6,881 speakers and spans approximately 343 hours of
words. The primary objective of our linguistic analy- recorded speech. Since we are interested in observing
sis is to test the efect of syllable token frequency and the role that some phonological aspects might play in
lexical stress on the accuracy of output neural representa- the recognition process, we used WebMAUS [12] to
obtain X-SAMPA transcriptions of the corpus. In addition,
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, we forced the model to recognize phonological syllables
Dec 04 — 06, 2024, Pisa, Italy as tokens, instead of automatically generated subwords
* Corresponding author. based on probability, frequency and likelihood [13]. We
† These authors contributed equally. designed a custom tokenizer that relies on the Maximal
$ Alessandro.Vietti@unibz.it (A. Vietti); dodecristofaro@unibz.it Onset Principle [14] and the Sonority Sequencing
Prin(D. D. Cristofaro); sapicciau@unibz.it (S. Picciau) ciple [15] and considers exceptionally /s/+stop clusters
0000-0002-4166-540X (A. Vietti)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License and geminates as part of the syllable onset [16, 17]. In
Attribution 4.0 International (CC BY 4.0).
order to observe the placement of the recognized tokens and adjacent Pws and viceversa. If a (partial) match is
and word boundaries in detail, we set the output format found, the word-level PT is appended to the
correspondof the model so that tokens are separated by blank spaces ing tokens; otherwise, unmatched words are labelled as
and words are separated by pipes, as it can be seen in inserted (when not found in Rs) or deleted (when not
example (1) found in Pt). Once word-level matches are identified, the
algorithm proceeds with the comparison of each Rt and
(1) il | vwO to | a sso lu to | Pt within Rw and Pw respectively, and it then assigns
the corresponding PT at a token level. The mechanism
to find token matches within words and assign
token2.2. Creation of the database level PT is analogous to the one described above. The
implementation of this algorithm allowed us to
automatOnce we tested the model and obtained the predictions, ically annotate most part of the dataset. However, many
we extracted a sample of 300 pairs of reference and pre- entries required manual intervention, as in the cases of
dicted sentences (Rs and Ps, respectively). The detailed assimilation or predictions characterized by a very low
observation of the pairs allowed us to define a set of pre- quality, which resulted in significant mismatches. Lastly,
diction types. Word-level prediction types are those that we added to our dataset some phonological information
afect canonical word boundaries and consist of three about each token in order to conduct our linguistic
analcategories: merged words, meaning two reference words ysis. We included relative frequency of Rt in the whole
recognized as one; divided words, consisting of a single dataset used for the training and lexical stress, as well
reference word recognized in two or more words; and as presence of the token in the training vocabulary, POS
token movement, namely the change of a reference token of Rw, and Rs speech rate. However, only the first two
position within adjacent word boundaries. At a token variables were taken into consideration for the statistic
level, prediction types represent deviances in terms of analysis in this work.
token insertion, substitution and deletion, as well as
correctly recognized tokens. We then designed a set of labels
(prediction tags PT - see Appendix A.1) representing the 3. Results
prediction types to annotate the tokens of our dataset.</p>
      <p>The labels consist of a sequence of afixes indicating the 3.1. Explorative analysis
detected recognition events. Word-level afixes are mer,
div, mv and, in case of token movement, forw or back to
mark the direction of the shift; token level afixes are ins,
sub, del, eq. Lastly, the sufix syl or word indicates if the
phenomenon regards an individual token or the whole
word. An example of our annotation can be seen below.</p>
      <p>To analyze our prediction database, we first looked at
the distribution of prediction types. Next, we used
Multiple Correspondence Analysis (MCA) to explore the
relationships between prediction types, token frequency,
presence in the training vocabulary, and lexical stress.</p>
      <p>The syllable-based fine-tuned ASR model showed a high
degree of accuracy in prediction, with only 28% of
tokens having notable recognition errors, making eq_syl
the most frequent category.</p>
      <p>The following figures show the detailed distribution of
marked prediction types. Our structured labeling system</p>
      <p>Given our dataset size of approximately 5900 tokens, allows us to separately examine token-level phenomena
a manual annotation of each entry would have been ex- and those afecting sentence structure due to word
boundtremely time-consuming. Therefore, we designed an al- ary errors. Figure 1 highlights that substitution is the
gorithm to operate a comparison of reference and pre- most common token-level operation, followed by
deledicted tokens (Rt and Pt, respectively) with the aim to tion and insertion. This means that most incorrectly
recobtain a semi-automated PT labeling. The algorithm ognized tokens still appear in the model’s hypothesized
works as follows: first, it attempts to identify the corre- transcription. However, token deletions and insertions
spondences between reference and predicted words (Rw, (including entire words like prepositions, determiners,
Pw) despite potential mismatches given by prediction or auxiliary verbs) lead to more significant recognition
types afecting word boundaries. Each pair of sentences discrepancies. It should be noted that the use of
automatiis split into words, and a function to calculate similar- cally generated phonological transcriptions as references
ity based on Levenshtein distance is used to confirm or increases the number of substitutions due to speech
varidismiss word matches. If the similarity score is lower ability in the corpus.
than the established threshold, it indicates a mismatch. Figure 2 shows the distribution of operation/equality
When this occurs, similarity is calculated between Rw tags afecting canonical word boundaries. Merging is the
most frequent process, involving 401 tokens, followed
by divided words with 206 occurrences, and movement
of single tokens with 48 instances. The movement label
applies to single tokens, unlike other categories. Tokens
in merged and divided words were mostly recognized
correctly, with substitution being the second most common
operation. Token deletion occurs more often in merged
words, while token insertion is higher in divided words.</p>
      <p>For moved tokens, the distribution of equal and
substituted tokens is nearly identical. Deletions and insertions
do not apply to moved tokens since they can’t be missing
or added in the prediction.
quency” (0-0.5%), from one-third to two-thirds is “mid
frequency” (0.5-2.23%), and from two-thirds to one is
“high frequency” (2.23-6.87%). Part of speech (POS) and
syllable type (tok_type_R) were added later as
supplementary variables to guide linguistic interpretation of
the analysis. Insertion, being the least frequent operation,
and complex syllable types (like CCVCC) were excluded
due to their low frequency.</p>
      <p>MCA is a dimensionality reduction technique for
categorical variables, so the significance of the dimensions
is derived from the distribution of the levels of the
variables projected onto the plane. Interestingly, the top
section shows that unstressed high-frequency tokens (over
2.23%), mainly subordinating conjunctions and
determiners, are associated with deletion. The bottom-left section
includes mid-frequency items (0.5% - 2.23%) with
simple syllabic structures (CV) that are typically recognized
correctly. Tokens with low frequency or which are
absent from the training vocabulary are on the right side
of the MCA chart. These less frequent, complex syllable
tokens, often occurring in proper nouns and numerals,
are typically handled with substitution.</p>
      <sec id="sec-1-1">
        <title>3.2. Multinomial analysis</title>
        <p>To statistically validate the findings from the MCA
(figure 3), we conducted a multinomial logistic regression
analysis using the nnet R library. The model examines the
interaction between token frequency and lexical stress
and, in this analysis, expresses the regression coeficients
in odds (instead of logits) (see Appendix A.2). By looking
at the plots of the model predictions and jointly
evaluating the pairwise comparisons from the two tables (see
Appendix A.4 and A.3), we can get a clearer
interpretation of the results of the regression analysis. In Figure
4, we notice that when the prediction is equal to the
reference, token frequency has a significant efect in the
case of stressed syllables, whereas it appears to be less
statistically relevant for unstressed syllables.
Additionally, the diference in the presence or absence of lexical
accent becomes significant as the frequency increases
from low to mid to high. Regarding substitution, the
patterns seem complementary to those observed in the
Figure 2: Count of deviations at a word level matching of reference and prediction (i.e., in the equal
plot). When syllables have a low frequency in the dataset,
the probability that they are replaced with other syllabic</p>
        <p>Figure 3 shows the Multiple Correspondence Anal- tokens significantly increases. Although we have not
ysis (MCA) results using the FactoMinerR R package. explored which syllabic tokens or types they are replaced
This analysis reveals patterns between prediction types with and based on what criteria, it is safe to assume that
(event_syllable), token frequency (freq_tok_R_cat), pres- it may be due to phonetic similarity. Specifically, there
ence in the training vocabulary (in_vocab_R), and lexical is a significant diference only between low frequency
stress (stress_R). The relative frequency of tokens in the and the combined mid and high frequencies for both
dataset was discretized into three levels using quantiles stressed and unstressed syllables. As for deletion, the
to obtain a uniform distribution of tokens across the three regression coeficients reveal that the probability of
delecategories: from zero to one-third of tokens is “low fre- tion of unstressed syllables increases with frequency, but
only in the transition from low to medium frequency, tuning a neural ASR model to process and recognize
with no further increase from medium to high frequency. phonological syllables, we were able to conduct a detailed
For stressed syllables, the neutralization of a frequency linguistic analysis of its output. Our findings indicate that
efect is confirmed from the analysis of the coeficient. syllable frequency and lexical stress significantly impact
A quick exploration of the most deleted mid-frequency ASR accuracy. Specifically, stressed syllables are more
syllables shows that the preposition ’a’ or V syllables in accurately recognized than unstressed ones, especially as
word-initial position are more likely deleted. frequency increases. Contrary to our expectation, among
the low-frequency syllables, stressed tokens are more
prone to substitution, whereas mid-frequency unstressed
4. Conclusions and future work ones are more susceptible to deletion. This demonstrates
the neural model’s sensitivity to both distributional
inforThis study provides insights into the role of syllables in mation in the dataset and phonological information and
ASR performance, particularly when integrating phono- highlights the model’s ability to detect varying syllabic
logical information into the recognition process. By fine- prominence at the lexical level within the signal. As
future work, we plan to include other linguistic factors as in- [9] V. N. Vitale, F. Cutugno, A. Origlia, G. Coro,
dependent variables to refine our analysis. An interesting Exploring emergent syllables in end-to-end
auapproach is to evaluate the impact of unstressed syllables tomatic speech recognizers through model
exand specific parts of speech by conducting an analysis plainability technique, Neural Computing
exclusively on content words. Furthermore, we aim to and Applications 36 (2024) 6875–6901. URL:
investigate in detail syllable substitution in relation to https://doi.org/10.1007/s00521-024-09435-1. doi:10.
token frequency and phonetic similarity to compare the 1007/s00521-024-09435-1.
weight of each factor whenever this strategy is adopted to [10] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen,
deal with low-frequency tokens. In conclusion, our study J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou,
showed the influence of token frequency and prominence S. Ren, Y. Qian, Y. Qian, M. Zeng, X. Yu, F. Wei,
in ASR predictions while demonstrating that complex Wavlm: Large-scale self-supervised pre-training
computational tools, like modern neural networks, can for full stack speech processing, IEEE Journal of
be efectively utilized by linguists to simulate and test Selected Topics in Signal Processing 16 (2022) 1–14.
linguistically relevant hypotheses. doi:10.1109/JSTSP.2022.3188113.
[11] R. Ardila, M. Branson, K. Davis, M. Henretty,</p>
        <p>M. Kohler, J. Meyer, G. Weber, Common voice:
References A massively-multilingual speech corpus, arXiv
(2020). URL: https://doi.org/10.48550/arXiv.1912.
[1] M. E. Beckman, The parsing of prosody, Language
and Cognitive Processes 11 (1996) 17–68. URL: https: [12]
0F.6S6c7h0.iedlo,iA:1s0t.a4ti8st5ic5a0l/maordXeilvfo.r1p9r1e2d.ic0t6in6g7p0r.onun//doi.org/10.1080/016909696387213. doi:10.1080/ ciation, in: Proceedings of the ICPhS 2015, Glasgow,
016909696387213. UK, 2015, p. paper 195.
[2] S. Hawkins, R. Smith, Polysp: A polysystemic, [13] T. Kudo, Subword regularization: Improving
neuphonetically-rich approach to speech understand- ral network translation models with multiple
subing, Italian Journal of Linguistics 13 (2001) 99–189. word candidates, in: Proceedings of the 56th
An[3] J. M. McQueen, L. Dilley, Prosody and spoken-word nual Meeting of the Association for Computational
recognition, in: C. Gussenhoven, A. Chen (Eds.), Linguistics (Volume 1: Long Papers), Association
The Oxford Handbook of Language Prosody, 2021, for Computational Linguistics, Melbourne,
Auspp. 508–521. tralia, 2018, pp. 66–75. URL: http://arxiv.org/abs/
[4] S. Greenberg, Speaking in shorthand—a syllable- 1804.10959.</p>
        <p>centric perspective for understanding pronuncia- [14] D. Kahn, Syllable-based generalizations in English
tion variation, Speech Communication 29 (1999) phonology, Ph.D. thesis, Massachusetts Institute
159–176. of Technology, 1976. URL: https://dspace.mit.edu/
[5] N. Morgan, H. Bourlard, H. Hermansky, Automatic handle/1721.1/16397.</p>
        <p>speech recognition: An auditory perspective, in: [15] G. N. Clements, The role of the sonority
cyS. Greenberg, W. A. Ainsworth, A. N. Popper, R. R. cle in core syllabification, in: J. Kingston, M. E.
Fay (Eds.), Speech Processing in the Auditory Sys- Beckman (Eds.), Papers in Laboratory
Phonoltem, Springer, New York, 2004, pp. 309–338. ogy: Volume 1: Between the Grammar and
[6] G. Coro, F. V. Massoli, A. Origlia, F. Cu- Physics of Speech, volume 1, Cambridge
Unitugno, Psycho-acoustics inspired automatic versity Press, 1990, pp. 283–333. URL: https:
speech recognition, Computers &amp; Electrical En- //doi.org/10.1017/CBO9780511627736.017. doi:10.
gineering 93 (2021) 107238. URL: https://doi.org/10.
1017/CBO9780511627736.017.
1016/j.compeleceng.2021.107238. doi:10.1016/j. [16] G. Marotta, L. Vanelli, Fonologia e prosodia
compeleceng.2021.107238. dell’italiano, Carocci editore, 2021.
[7] C. S. Anoop, A. G. Ramakrishnan, Suitability [17] M. Krämer, The Phonology of Italian, Oxford
Uniof syllable-based modeling units for end-to-end versity Press, Oxford, New York, 2009.
speech recognition in sanskrit and other indian
languages, Expert Systems with Applications
220 (2023) 119722. URL: https://doi.org/10.1016/
j.eswa.2023.119722. doi:10.1016/j.eswa.2023.</p>
        <p>119722.
[8] C. J. Cho, A. Mohamed, S.-W. Li, A. W. Black, G. K.</p>
        <p>Anumanchipalli, Sd-hubert: Sentence-level
selfdistillation induces syllabic organization in hubert,
arXiv (2024). URL: http://arxiv.org/abs/2310.10803.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>A. Appendix</title>
      <sec id="sec-2-1">
        <title>A.1. Prediction types (PT)</title>
        <p>Label
eq_syl
sub_syl
ins_syl
del_syl
sub_syl_word
ins_syl_word
del_syl_word
mv_eq_forw_syl
mv_sub_forw_syl
mv_eq_back_syl
mv_sub_back_syl
div_eq_syl
div_sub_syl
div_ins_syl
mer_eq_syl
mer_sub_syl
mer_ins_syl
mer_del_syl</p>
        <p>Prediction
do po | al ku ni |
mO do | ve tSo |
i | lo ro | a bi ta tta |
kom ple ta men te | sO - |
kon | E | di ven ta to |
te | i |
so pra ttu tto | - | ma ssa ka tSe ts |
o ri dZi | ni mi ti ke |
E stre | ro u ma no |
da ve | tra te |
tu tta vi a no |
a | pu ddZa | da
a | pu ddZa | da
fra | zi i |
kwa ttro po sti |
sE | la u re a to |
pu kwe stE ro no | kO lle
fi nO - tto |
do po | al ku ni |
mO do | de tSo |
i | lo ro | a bi tat |
kom ple ta men te | so lo |
non | E | di ven ta to |
ti |
so pra ttu tto | in | ma ssa tSu se tts |
o ri dZi ni | mi ti ke |
E sse re | u ma no |
da | ve tra te |
tu tta vi a | non |
a ppo ddZa ta |
a ppo ddZa ta |
fra zi |
kwa ttro | po sti |
si | E | la u re a to |
kwe stEr mo | ko lle |
fi no | ad | O tto |
estimate
A.2. Summary of the model
y.level
deletion
deletion
deletion
deletion
deletion
deletion
substitution
substitution
substitution
substitution
term
(Intercept)
freq_tok_R_catmid
freq_tok_R_cathigh
stress_Runstr
freq_tok_R_catmid:stress_Runstr
freq_tok_R_cathigh:stress_Runstr
(Intercept)
freq_tok_R_catmid
freq_tok_R_cathigh
stress_Runstr
A.3. Pairwise comparison by stress
A.4. Pairwise comparison by frequency
stress_R
str
str
str
unstr
unstr
unstr
str
str
str
unstr
unstr
unstr
str
str
str
unstr
unstr
unstr
pred_type
equal
equal
equal
equal
equal
equal
deletion
deletion
deletion
deletion
deletion
deletion
substitution
substitution
substitution
substitution
substitution
substitution
term
freq_tok_R_cat
freq_tok_R_cat
freq_tok_R_cat
freq_tok_R_cat
freq_tok_R_cat
freq_tok_R_cat
freq_tok_R_cat
freq_tok_R_cat
freq_tok_R_cat
freq_tok_R_cat
freq_tok_R_cat
freq_tok_R_cat
freq_tok_R_cat
freq_tok_R_cat
freq_tok_R_cat
freq_tok_R_cat
freq_tok_R_cat
freq_tok_R_cat</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>