1. Introduction

When Less Is More? Diagnosing ASR Predictions in Sardinian via Layer-Wise Decoding

Domenico De Cristofaro

0 1

Alessandro Vietti

0 1

Marianne Pouplier

Aleese Block

2 0 ALPS, Alpine Laboratory of Phonetic Sciences 1 Free University of Bozen-Bolzano , Italy 2 LMU Munich , Germany

2025

Recent studies have shown that intermediate layers in multilingual speech models often encode more phonetically accurate representations than the final output layer. In this work, we apply a layer-wise decoding strategy to a pretrained Wav2Vec2 model to investigate how phoneme-level predictions evolve across encoder layers, focusing on Campidanese Sardinian, a low-resource language. We show that truncating upper transformer layers leads to improved Phoneme Error Rates (PER), with the best performance achieved not at the final layer, but two layers earlier. Through fine-grained alignment analysis, we ifnd that intermediate predictions better preserve segmental identity, avoid overgeneration, and reduce certain classes of phonological errors. We also introduce the notion of regressive errors-cases where correct predictions at intermediate layers are overwritten by errors at the final layer. These regressions highlight the limitations of surface-level error metrics and reveal how deeper layers may generalize or abstract away from acoustic detail. Our findings support the use of early-layer probing as a diagnostic tool for ASR models, particularly in low-resource settings where standard evaluation metrics may fail to capture linguistically meaningful behavior.

eol>Speech Recognition Low-Resourced Languages Logit Lens Interpretability

1. Introduction

layer-wise decoding analysis using the pretrained facebook/wav2vec2-xlsr-53-espeak-cv-ft Recent research in multilingual speech foundation mod- model on Sardinian audio data. We progressively els has revealed that intermediate representations often truncate the encoder by removing a varying numencode richer phonetic information than the final out- ber of top transformer layers before decoding. For put layer. Using Logit Lens-style probing across encoder each configuration, we decode phoneme sequences layers, studies such as Shim et al. [1] and Langedijk et al. and compare the output to gold-standard phonemic [2] have shown that earlier layers in transformer-based transcriptions, measuring overall Phoneme Error Rate models such as Whisper yield lower Word Error Rate (PER) and analyzing error types (insertions, deletions, (WER) and Character Error Rate (CER). substitutions). Our Contributions:

Building on this line of work, we investigate whether removing upper transformer layers in a pretrained • we present a phoneme-level layer-wise analysis multilingual ASR model influences its phoneme-level of Wav2Vec2 on a low-resource Sardinian dataset. decoding behavior. Our hypothesis is grounded in prior • we introduce the notion of regressive errors in ASR ifndings—particularly those of Shim et al. [1]—which layer-wise decoding. demonstrate that applying a Logit Lens probing strategy • we show that intermediate layers (e.g., Layer 22) to intermediate encoder layers results in lower CER for yield more phonetically accurate hypotheses than low-resource languages unseen during training. How- the final layer. ever, this raises a crucial question: what kinds of errors are actually reduced when decoding from intermediate 2. Related Works layers instead of the full model? More specifically, are the mistakes made by the final layer already resolved in earlier layers? To answer this, we perform a systematic

Interpretability has become a central concern in the anal

ysis of deep learning models for NLP and speech, particularly when it comes to understanding how linguistic representations emerge across network layers. In ASR, probing techniques such as Singular Vector Canonical Correlation Analysis (SVCCA) [3] and layer-wise probing classifiers [ 4] have been used to assess the presence of phonetic and phonological features in hidden representaproperties can be selectively removed from representa- as further reduction leads to a substantial degradation in tions, suggesting that such information is not uniformly performance, with PER increasing sharply beyond this distributed across layers. A particularly efective method point, reaching over 70% of PER at Layer 16. Decoded for layer-wise interpretability is the logit lens [6]. Early phoneme sequences are aligned to the gold phonemic exiting strategies are grounded in the observation that transcriptions using a phoneme-level alignment algointermediate layers of deep neural models often sufice rithm based on SequenceMatcher. This allows us for accurate predictions, allowing for more eficient com- to categorize each prediction as a correct match (hit), putation and improved robustness [7, 8, 9]. More re- substitution, insertion, or deletion. Note that insertions cently, this idea has been extended beyond eficiency: in are rarely observed in embedding-level decoding with interpretability research, intermediate predictions have CTC models, as output units are selected frame-wise. become a powerful tool for analyzing representational dy- Many deletion errors may instead reflect phoneme namics. The logit lens approach [6], for example, projects mergers or coarticulation phenomena. To quantify hidden states into output space to visualize how predic- the impact of layer removal on ASR performance, we tions evolve across layers. Subsequent refinements [ 9, 10] compute the PER at each truncation level. In addition, have made these projections more faithful by learning we track phoneme-level alignment patterns and analyze layer-specific transformations, revealing how informa- the disappearance or emergence of specific error types tion is incrementally constructed. While these methods as the number of removed layers increases. have mostly been explored in the context of decoder-only language models, some recent work has adapted them to 3.1. Dataset speech systems. Langedijk et al. [2] extend the logit lens to encoder-decoder architectures such as Whisper, while Shim et al. [1] demonstrate that early-layer representations in multilingual speech models may better capture phonetic distinctions—particularly in under-represented languages. In this work, we extend this line of research by investigating why intermediate-layer decoding leads to improved performance, and whether this strategy is truly efective for low-resource languages. Rather than using early exits purely for eficiency, we treat them as a probing tool to examine how phoneme representations emerge and evolve across layers in a multilingual speech model.

The data used in this study consists of spontaneous speech recordings in Campidanese Sardinian, a variety spoken in the southern part of Sardinia. The recordings were collected during fieldwork as part of the DID project in the municipality of Sinnai. The dataset includes 48 short utterances produced by four native speakers (two female, two male), selected from longer recordings based on linguistic relevance and clarity. The mean duration of the utterances is approximately 4.06 seconds. All utterances were manually transcribed at the phonemic level by a trained phonetician who is also a native speaker of Campidanese. The resulting dataset provides a highquality phonemic reference for evaluating model predictions in a low-resource, under-represented language context [13, 14, 15].

3. Methodology

We analyze the layer-wise phoneme decoding behavior of a pretrained multilingual ASR model, 4. Results facebook/wav2vec2-xlsr-53-espeak-cv-ft [11], which is a wav2vec based model fine tuned on phonemic As shown in Table 1, removing the top layers of the transcriptions from the Common Voice dataset [12] encoder leads to a consistent reduction in PER, with the using a CTC loss. The model has 25 transformer encoder best performance observed when two layers are removed. layers stacked above a 7 layers of convolutional feature This result supports the hypothesis that intermediate encoder. To probe the phonetic content across layers, transformer layers perform better also on unseen lowwe apply a truncation-based decoding strategy: for resourced languages. each utterance, we progressively remove transformer layers (where ∈ {0, 1, . . . , 5}) and perform greedy 4.1. Global Trends and Error Type decoding on the logits computed from the last remaining Evolution layer. This is possible because all transformer layers share the same hidden dimension, allowing the model’s Figure 2 provides a global view of how the model’s ifnal projection head to be applied to intermediate phoneme-level predictions evolve as top layers are relayer outputs without architectural modification. As moved. As expected, the number of correctly predicted a result, we can decode phoneme sequences from any phonemes (labeled as "hit") steadily decreases as more encoder layer using the same decoding pipeline. We layers are removed. At the same time, deletion errors limit the truncation to a maximum of 5 layers removed, increase sharply, particularly from Layer 21 backward,

Layer PER all—maybe especially for shorter or acoustically reduced 24 36.73 segments. At deeper layers, the model may attempt to 23 36.50 recover some of these missing elements by assigning 22 35.40 them a plausible phonemic category, potentially relying 21 38.92 more on contextual or phonotactic patterns than on local 20 50.03 acoustic evidence. This supports a view of hierarchi19 66.07 cal processing, where early layers encode fine-grained Table 1 phonetic detail, while later layers abstract away from Phoneme Error Rate (PER) for diferent truncation levels. it, integrating higher-level dependencies that can both resolve and distort the original signal. However, this notion of hierarchical abstraction is model-dependent and assumes a certain architectural behavior. Since we eventually dominating the error profile at Layer 19. This do not impose constraints on the model design, further shows that by removing layers, the model lacks informa- work is needed to test whether this abstraction emerges tive representations and tends to prefer skipping a predic- consistently across architectures. tion rather than producing an incorrect one. In contrast, To better understand these dynamics, we examine substitution errors remain relatively stable across Lay- which phonemes are most frequently involved in deleers 24-22 and begin to decline slightly in deeper layers. tion and substitution errors. As shown in Figure 1, vowel This pattern suggests that intermediate layers may retain phonemes such as /i/, /u/, and /a/ are among the most more accurate segment-level information, minimizing frequently deleted and substituted segments—especially confusion between phonetically similar units. However, as the number of removed layers increases. Interestingly, the sharp increase in deletions at lower layers should these three vowels are the only ones that commonly not be interpreted as a simple reclassification of previ- appear in unstressed final position in Campidanese Sarous substitutions. Instead, it indicates that the model is dinian. While the model is not explicitly aware of word increasingly unable to resolve a segmental identity at boundaries, its predictions appear sensitive to acoustic 4.2. Regressive Errors: When Hits Become cues associated with prosodic prominence. These vowels Mistakes are more likely to be reduced in duration and formant clarity when unstressed, and the model’s tendency to While final-layer predictions often improve overall acdelete them may reflect a broader dificulty in segment- curacy, we also observe notable exceptions where the ing low-prominence units—an efect we also observed in opposite occurs—cases in which the correct phoneme is our previous analysis of stress and frequency in phoneme already identified at an intermediate layer but becomes recognition [16]. Some vowel deletions may also be ex- an error at the final layer. We refer to these as regressive plained by the mismatch between phoneme duration and errors: instances where a phoneme is correctly predicted the convolutional receptive field of the model’s encoder. (a hit) at Layer 22 or 23, but turns into a substitution or Since input frames are processed with overlapping win- deletion at Layer 24. We define a regressive error as a dows, short vowels may be underrepresented or merged, case where a correct prediction (hit) at an intermediate leading to systematic omissions during decoding. Most layer ℓ is replaced by a substitution or deletion at a deeper of the substitutions involve phonetically close phoneme layer ℓ + (with > 0). In total, we identify 53 such pairs, difering by a single articulatory feature such as regressions across the dataset: 39 cases of hit → substivoicing, manner, or vowel height. For instance, one of the tution and 14 cases of hit → deletion. These regressions most frequent substitutions is /E/ → /e/, a mid-front indicate that the full encoder may in some cases “overprovowel contrast distinguished primarily by height. Simi- cess” the input, replacing a correct low-level prediction larly, /O/ → /o/ reflects a rounded back vowel pair with with a less accurate one as more layers are added. Crua similar height diference. Another recurrent case is /G/ cially, most regressions involve substitutions, suggesting → /g/, where a uvular fricative is replaced by a voiced that deeper layers may introduce abstractions that distort plosive, suggesting the model struggles with fine-grained ifne-grained segmental information—trading of phonetic place and manner distinctions in lower layers. These pat- precision for contextual generalization. This may reflect terns support the hypothesis that, while intermediate a dual mechanism: (a) the re-integration of previously layers reduce substitution errors, the model’s phonolog- deleted segments, particularly those corresponding to ical representations remain coarse. Segment identity is short or hard-to-classify frames, and (b) the remapping preserved at a broad class level, but phonetic resolution of rare or marked phonemes onto broader, more frequent weakens as contextual information is reduced. Overall, categories. In this sense, earlier layers (e.g., Layer 19) the observed substitution patterns are not random, but may in fact produce transcriptions that are more faithful structured according to articulatory proximity, as further to the phonetic input, while later layers enforce higherconfirmed in Figure 1. level regularities at the cost of segmental detail. This challenges a common assumption: that improved overall error rates necessarily reflect more accurate linguistic representations. Instead, our findings suggest that intermediate layers may better preserve phoneme identity

4.3. Utterances with Largest PER Reduction

in certain cases, while the final layer smooths over or fricative /Z/. Interestingly, at Layer 22, the model precollapses distinctions that are phonologically relevant. dicts a more plausible onset sequence /e:ntsu/ (in blue), To better understand the nature of these regressions, we which is closer to the expected /ensu/, suggesting a analyze which phonemes are most frequently afected. better alignment with the reference. Additionally, the Among the 53 cases, the high back rounded vowel /u/ is ifnal segment /i/ is still present in both Layer 22 and the most common (13 instances), followed by the alveolar 23, but is ultimately deleted in Layer 24. This suggests approximant /r/ (7 instances), and others such as /n/, that the full model may over-generalize phonetic detail, /i/, and /a/. Notably, many of the regressive substitu- leading to the omission of segments that were correctly tions involving /u/ involve replacement with acousti- predicted in earlier layers. The evidence supports our cally similar vowels like /o/ or /U/ in the final layer—a broader claim: improvements in PER at intermediate laypattern aligned with known vowel confusions in Sar- ers are not merely a side-efect of over generalization, dinian phonology [17]. but reflect a more faithful alignment to the input acoustics. In this case, Layer 22 preserves both the segmental identity and sequence structure more reliably than the full encoder.

iE5ntsu:tVmla:u5Nti:S iE5ntsu:tamla:u5Nti:Si: e:ntsutamla:u5tiSi ensudwamillaundiZi To explore whether layer truncation improves phoneme Layer 24 decoding in a linguistically meaningful way, we identify Layer 23 the five utterances that show the greatest PER reduction Layer 22 between Layer 24 and Layer 22 (Table 5). A qualitative Reference inspection reveals that intermediate-layer outputs more closely approximate the reference transcriptions—not Table 3: Layer-wise phoneme predictions for utteronly in terms of segmental identity but also in overall ance 30_F_extract_04. sequence structure. While final-layer predictions sometimes exhibit phoneme insertions or reduplications that A similar phenomenon is observed in Table 4, where inflate the hypothesis length, the intermediate outputs the utterance 03_F_extract_01 demonstrates how the tend to be more balanced and structurally coherent. This ifnal layer introduces segmental distortions not present observation suggests that improvements in PER at in- in earlier representations. At Layer 22, the model protermediate layers are not merely an artifact of shorter duces a concise and well-aligned output that accurately sequences, but reflect more accurate segmental parsing captures the alveolar flap /4/ ( /R/) and avoids inserting and alignment. Rather than underpredicting, these lay- extraneous phonetic material. Notably, the vowel preceders appear to produce hypotheses that better capture ing /4/ is realized as a short /e/ in the prediction from the linguistic and prosodic shape of the input, avoiding Layer 22, closely matching the reference transcription. In overgeneration without compromising coverage. These contrast, Layers 23 and 24 both produce an elongated /e:/ improvements are quantitatively confirmed in Table 2, vowel. While this lengthening is not annotated in the refwhere PER consistently decreases when decoding from erence, a manual inspection of the spectrogram reveals Layer 22 compared to the full model. The most dramatic that the vowel is indeed phonetically long (approximately case is 03_F_extract_01, with a 50% relative reduc- 297 ms), possibly due to prosodic or pragmatic factors. tion in PER, followed by 30_F_extract_04, which im- This suggests that vowel duration is a feature that only proves by nearly 28 absolute percentage points. In both emerges at higher layers, where the model integrates cases, the intermediate-layer output avoids spurious in- broader contextual information. Rather than being an ersertions and better aligns with the prosodic structure of ror, the elongation may reflect the model’s sensitivity to the utterance. Even for more moderate improvements prosodic prominence, which is not explicitly captured in (e.g., 46_M_extract_04 and 29_M_extract_03), we the phonemic gold standard but is present in the acoustic observe a shift toward more plausible segmental struc- signal. In this case, then, the intermediate layer ofers tures and reduced redundancy. These findings reinforce a segmentally accurate representation aligned with the the idea that intermediate representations strike a favor- reference, while the deeper layers introduce prosodically able balance between acoustic faithfulness and contex- informed variation. This highlights how diferent layers tual abstraction—preserving enough low-level detail to may prioritize diferent levels of linguistic abstraction, make accurate segmental decisions while avoiding the with earlier layers preserving phonemic detail and later overgeneralization seen in later layers. ones encoding broader discourse or prosodic cues.

As illustrated in Table 3, the final layer output includes several critical errors: an initial vowel /i/ (in red) that does not appear in the reference, and an incorrect final segment /S/ (also in red) that replaces the true voiced 03_F_extract_01 30_F_extract_04 46_M_extract_04 30_F_extract_02 29_M_extract_03 14.29 83.33 44.74 46.15 40.00 7.14 55.56 36.84 38.46 33.33

distinctions aligns with a view of hierarchical abstraction in neural models. From a phonological perspective, this might suggest that neural encoders learn generalizable phonemic categories early on and gradually shift toward context-dependent or prosodically conditioned outputs.

Future work could explore whether this abstraction follows typologically consistent patterns across languages.

6. Conclusions 5. Discussion This study explored the use of layer truncation as a prob

Our findings challenge a widespread assumption in ing strategy for understanding phoneme-level decoding speech modeling: improvements in error metrics like behavior in a multilingual ASR model. By applying Logit PER necessarily reflect more accurate or linguistically Lens-style analysis to Wav2Vec2, we show that intermemeaningful predictions. While intermediate layers of the diate layers can outperform the final layer in terms of Wav2Vec2 model often yield lower PER, a closer analy- Phoneme Error Rate—particularly for a low-resource lansis reveals that this improvement is not uniformly dis- guage like Sardinian. Beyond aggregate improvements, tributed across all phoneme classes or error types. This our fine-grained error analysis reveals two key insights: aligns with an ongoing open question in speech mod- (1) intermediate predictions tend to avoid certain types eling, why do higher layers often decrease WER while of phonological errors, and (2) in some cases, deeper increasing PER? The answer may lie in how deeper layers layers actually degrade performance by transforming prioritize lexical or orthographic consistency over pho- previously correct phonemes into errors. These findings netic detail, leading to better word-level predictions at suggest that the final output of a model may not always the cost of segmental precision. We observe that interme- be the most linguistically faithful, especially in scenarios diate layers (particularly Layer 22) reduce overgeneration involving limited training data or typologically diverand avoid certain errors—such as spurious insertions or gent phonemes. We argue that future work on speech phoneme duplications—that become more frequent at recognition in low-resource settings should move beyond deeper layers. In several cases, these intermediate pre- traditional evaluation metrics and incorporate layer-wise dictions better align with the gold transcription both in analysis as a standard interpretability tool. Doing so structure and content, despite being produced with less can provide deeper insight into how models represent contextual depth. Interestingly, we also identify cases phonological information—and where they fail. of regressive errors, where correct predictions made at intermediate layers are degraded at the final layer. These Future work. While our analysis focused on Campitypically involve deletions or substitutions of phonemes danese Sardinian, applying this strategy across typologilike /u/ and /E/, often replaced with acoustically similar cally diverse low-resource languages would help detersegments. This suggests that deeper layers may gener- mine whether the benefits of intermediate-layer decodalize segmental contrasts. Taken together, these results ing generalize. Additionally, attention dynamics across indicate that error metrics like PER or CER, while use- layers may provide further insight into which represenful at a high level, may obscure critical model behaviors. tations are retained, distorted, or lost as contextual depth Intermediate representations may contain more faith- increases. While the model is optimized for phoneme ful segmental information than the final output layer, transcription, it is not trained on forced-aligned phoneme particularly in under-represented or low-resource lan- segmentation. Future work could investigate whether guage settings. The fact that intermediate layers retain ifne-tuning on time-aligned phoneme labels or segmenphoneme-level precision while later layers smooth over

Acknowledgements Funded by the European Social Fund Plus Project code

ESF2_f3_0003 “Excellence Scholarships for PhD students on topics of strategic relevance for South Tyrol”. Work funded by the New Perspectives on Diphthong Dynamics (DID) project I83C22000390005. tation tasks improves final-layer predictions and reduces regressive errors. It would also be valuable to replicate this analysis on a language that was part of the model’s pretraining or fine-tuning data (e.g., English) to assess whether intermediate layer advantages persist even in high-resource settings. Audio File Layer-wise SAMPA predictions and reference for utterances with the largest PER improvement.

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) and DeepL Write / DeepL Translate in order to: Drafting content, Text translation, Paraphrase and reword, and Improve writing style. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

30_F_extract _ 02 03_F_extract _ 01 29 _

extract_03 24: iE5ntsu:tVmla:u5Nti:S 23: iE5ntsu:tamla:u5Nti:Si: 22: e:ntsutamla:u5tiSi Ref: ensudwamillaundiZi 24: dedega:nivutibiStozorUnsa:kuzo:apEttsa:zU 23: dedega:nivutebiStuzogunsa:kuzo:apEtsa:zu 22: dedega:nivutebStzorunsakuzoapEtsazu Ref: dEdEGanivuntibiStiuzuGusakuzuapEtsauzu 24: snunorantazzaeti 23: snunorantazzaeti 22: snunorantazaeti Ref: sunO4antazEti 24: ekambjadame:4a 23: ekambjadame:4a 22 : ekambjadame4a Ref: eekambjadame4a L0: miza:gata:oudegonoSamuleDimiae L1: mizagata:oudegonoSamuleDimiae L2: mizagataodegonoSamuleDimiae Ref: mizEaGataudeGOnOSamullEDimiaE?