=Paper=
{{Paper
|id=Vol-3361/paper6
|storemode=property
|title=Improved Dialect Recognition by Adaptation to a Single Speaker
|pdfUrl=https://ceur-ws.org/Vol-3361/paper6.pdf
|volume=Vol-3361
|authors=Manuel Vogel,Guido Kniesel,Alberto Calatroni,Andrew Paice
|dblpUrl=https://dblp.org/rec/conf/swisstext/VogelKCP22
}}
==Improved Dialect Recognition by Adaptation to a Single Speaker==
Improved Dialect Recognition by Adaptation to a Single Speaker Manuel Vogel1 , Guido Kniesel1 , Alberto Calatroni1 and Andrew Paice1 1 iHomeLab Think Tank and Research Centre for Building Intelligence, Lucerne University of Applied Sciences and Arts (HSLU), Technikumstrasse 21, CH-6048 Horw, Switzerland Abstract Voice assistants understanding dialects would help especially elderly people. Automatic Speech Recognition (ASR) performs poorly on dialects due to the lack of sizeable datasets. We propose three adaptation strategies which allow to improve an ASR model trained for German language to understand Swiss German spoken by a target speaker using as little as 1.5 hours of speaker data. Our best result was a word error rate (WER) of 0.27 for one individual. 1. Introduction instead of trying to learn several different dialects with a single model. We investigated different approaches of Automatic Speech Recognition (ASR) refers to the task training/fine-tuning and we assessed the performance of of converting an audio signal into its written transcrip- a pre-trained model adapted on a single speaker. tion and finds application, among others, in voice assis- tants. ASR performs well on so-called well-resourced1 languages, while results on dialects, specifically Swiss 2. Related Work German, are poorer. This is particularly inconvenient for the acceptance of applications involving smart assistants In recent studies, the application of end-to-end ASR mod- for elderly people, for whom it might be a big nuisance els (from raw audio to the words) based on deep neural to switch to Standard German. ASR for Swiss German is networks has shown a considerable performance boost. challenging for several reasons: To achieve good results, a considerable amount of train- ing data is needed [1]. In the case of Swiss German, there 1. Swiss German has no standardized written form is a lack of enough data, variability and the appropriate and Standard German is the output of choice, ground truth. An exception is the recently published meaning that the system must provide speech Swiss Parliaments Corpus (SPC), which we use in our translation (ST) rather than mere recognition. For work [2]. example, the German expression «wollen wir» The two noteworthiest end-to-end architectures are could be pronounced and written in several dif- Conformer [3] and wav2vec2 [4]. The latter achieved the ferent variants in Swiss German, e.g. «wömmer», best WER to date in German ASR (WER 0.057). Therefore, «wemmer», «wemmr» or «wämmer». we chose wav2vec2 as starting point. 2. Swiss German dialects are diverse and not geo- ASR systems for low-resource dialects have lately at- graphically well confined. Thus, creating regional tracted some attention [5, 6, 7, 8, 9]. When looking at models would be challenging. Swiss German, we find the work of Plüss et al. [2], who 3. The publicly available Swiss German datasets are claim a WER of 0.289 using a Conformer model on a the few and small compared to the corpora for other SPC dataset. Other researchers combined the SPC with languages. Training on thousands of hours of a proprietary dataset to train various ST systems and data to account for variability is not possible. achieved a WER of 0.5 when using only the SPC dataset [10]. A further approach achieved a WER of 0.39 on the The contribution of this work is an exploration on how SPC by training a model on a German dataset, transfer person-specific data can be used to tailor known mod- learning to the SPC enhanced with a proprietary internal els towards better performance for a specific individual dataset and refining the classification with a re-scoring SwissText 2022: Swiss Text Analytics Conference, June 08–10, 2022, [11]. Our evaluation yields results similar to Plüss et al Lugano, Switzerland [2], even if a direct comparison is not possible, and gives Envelope-Open manuel.vogel@hslu.ch (M. Vogel); interesting insights about different fine-tuning strategies. guido.kniesel@hslu.ch (G. Kniesel); alberto.calatroni@hslu.ch (A. Calatroni); andrew.paice@hslu.ch (A. Paice) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 3. Materials and Methods CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 We here describe the baseline model, datasets and adap- Proceedings 1 Well-resourced refers to the availability of abundant labeled data corpora to train machine learning algorithms. tation approaches. 3.1. Model of audio; therefore, for a fair comparison, we limited all six SSCs to 1.5 hours. We base our work on a pre-trained wav2vec2 model2 available from the HuggingFace3 AI community. We denote this model as baseline . The model topology con- 3.3. Approaches to Single Speaker sists of convolutional layers which map the raw audio Adaptation to latent quantized speech representations and a Trans- Our goal is to adapt the baseline model to perform ASR former structure which maps to context representations. satisfactorily with a single target speaker. We propose The first pre-training stage involves self-supervised learn- three approaches, which we evaluated on six SSC: ing and therefore does not need labeled data [4]. For the baseline model, pre-training is done with multilin- 1. Supervised training of the baseline model on gual data (53 languages) to learn language-independent the SPC excluding all target SSCs. speech units, followed by supervised training with Ger- 2. Fine-tune the baseline model with data from the man data, since Swiss German has strong similarities target SSC. with German. 3. Combine the two previous approaches by training the baseline model on SPC excluding all target 3.2. Dataset SSCs, then fine-tune with the target SSC. In our experiments we used the Swiss Parliaments Corpus In addition, we also evaluate the baseline model on (SPC), a Swiss German dataset that contains recordings all target SSCs. The three approaches are visualized in and transcriptions of the cantonal parliament of Bern Figure 1. (Grosser Rat Kanton Bern) [2]. It contains 293 hours of audio by 198 speakers and represents the biggest Swiss German speech recognition dataset to date. We use a 4. Results subset of the SPC containing only samples with a high We conducted several experiments in line with the ap- alignment between text and audio4 . The audio files con- proaches described in Section 3.3 and evaluated the mod- tain mostly Swiss German speech, whereas the labels els against held-out test data, reporting the word error (transcriptions) are in Standard German. We chose SPC rate (WER). We show the results in Table 1. The best mainly because of its size. In comparison to existing approach for adapting to a single speaker is the last one, Swiss German datasets, such as ArchiMob [12], SPC has i.e., to first train the model on Swiss German data from more audio data, which allows us to experiment better several speakers and then on the corresponding target with various sizes of the single-speaker datasets. In ad- speaker dataset. Interestingly, fine-tuning the baseline dition, we recorded a new small dataset from a speaker model only with the target single speaker datasets gives unrelated to the SPC dataset, which allows us to test our worse results compared to training a model on multiple approaches on another context. It is based on utterances speakers (SPC-without-top5 ). However, it is important of the Voxforge speech corpora5 . We call this the external to note that the SSCs contain only 1.5 hours of data, speaker (shortened: «ext»). whereas SPC-without-top5 contains around 176 hours. The individual improvement of the adaptation depends 3.2.1. Dataset Partitioning on the speaker and varies between 1% and 4% on the five From the SPC dataset we created convenient partitions SPC speakers and reaches a notable 14% on the external for the experiments. In the original corpus we identified speaker. Speaker 82 has the highest WER when eval- the five speakers that have the biggest amount of data. uated with the base model but the lowest WER when These are the speakers with IDs 82, 145, 177, 186 and fine-tuning with multiple Swiss German speakers and/or 207. Together with our own small external dataset, this the single speaker dataset of speaker 82. In contrast, yields six datasets. We refer to them as Single Speaker speaker 207 has the lowest WER when evaluated with Corpora (SSC). For the approaches which involve a train- the base model, but the highest WER using the other ing step with multiple speakers, we extracted a subset three approaches. The reasons for this behaviour could of the SPC which excludes the speakers identified above not be fully determined and further investigations are (SPC-without-top5 ). Among the single speakers, the future work. one with the least amount of data has around 1.5 hours 4.1. Influence of Training Data Amount 2 The exact model used is wav2vec2-large-xlsr-53-german . Increasing the data for the single speaker training has 3 https://huggingface.co/ not led to a significant reduction of the WER. When 4 Intersection over Union (IoU) > 0.9, train_0.9 5 training the model resulting from the second approach http://www.voxforge.org/de/downloads Figure 1: Visualization of the three adaptation approaches. SPC-without-top5 denotes the SPC without all five target speakers and X denotes a speaker ID. Approach Fine-Tuning 82 186 207 177 145 ext Baseline - 0.90 0.89 0.82 0.89 0.86 0.82 1 SSC 0.56 0.67 0.60 0.64 0.60 0.50 2 SPC 0.31 0.42 0.44 0.35 0.38 0.44 3 SPC+SSC 0.27 0.39 0.41 0.34 0.36 0.30 Table 1 Word error rates (WER) for different speakers and training strategies. The numeric column headers are the SPC speaker IDs and ext denotes the external speaker. «SSC» and «SPC» stand for Single Speaker Corpus and Swiss Parliaments Corpus respectively. The first row shows the results of the pre-trained model. with six hours of audio from speaker 82 instead of only 0.27 on the same test set, giving an improvement of 17%. 1.5 hours the WER decreases only by 2%. An identical Training the baseline model on SPC-without-top5 and improvement is observed when training with four hours then on 0.25 hours of data of the external speaker still of speaker 207 instead of 1.5 hours. Decreasing the time performs better than using the model trained only on used for single speaker training does increase the WER: SPC-without-top5 and the model trained only with 1.5 When training with a third of the external single speaker hours of data of the external speaker. dataset, the WER increases by 4% and when training with a sixth, the WER increases by 5%. 4.3. Limitations One limitation is the prevalence of one specific dialect 4.2. Impact of Multi-Speaker Fine-Tuning (Bernese) in the SPC. Furthermore, the SPC was recorded A remarkable result is the impact of Swiss German fine- in a parliament and has therefore a certain bias in terms tuning before the single speaker adaptation. Training the of content. The results can also be influenced by the baseline model on the full SSC of speaker 82 (six hours), combination of the chosen metric and the ground truth. it achieves a WER of only 0.44. In comparison, training For instance, if the audio contains the phrase session vom the model first with SPC-without-top5 and then fine- september and the label is septembersession, the WER in- tuning with 1.5 hours of speaker 82, achieves a WER of creases if the model predicts the former phrase, even if the two options are semantically identical. In addition, Signal and Information Processing Association An- Swiss German does not have a past simple tense. Con- nual Summit and Conference (APSIPA ASC), 2019, sequently, if the label is written in past simple, there is pp. 628–632. doi:10.1109/APSIPAASC47483.2019. a significant difference in the structure of the spoken 9023130 . sentence and the ground truth. [6] R. Imaizumi, R. Masumura, S. Shiota, H. Kiya, Dialect-aware modeling for end-to-end japanese dialect speech recognition, in: 2020 Asia-Pacific 5. Conclusion Signal and Information Processing Association An- nual Summit and Conference (APSIPA ASC), 2020, We presented three possible strategies to adapt a pre- pp. 297–301. trained ASR model based on wav2vec2 to enhance the [7] Y. Zhao, J. Yue, X. Xu, L. Wu, X. Li, End-to- recognition on a single Swiss-German-speaking individ- end-based tibetan multitask speech recognition, ual. The best strategy appears to be training a baseline IEEE Access 7 (2019) 162519–162529. doi:10.1109/ model with multiple Swiss German speakers and in a ACCESS.2019.2952406 . second phase fine-tuning with a small amount of data [8] Y. Zhang, M. Ablimit, A. Hamdulla, Error correction from the target speaker. With this strategy, the WER for based on transformer lm in uyghur speech recogni- six speakers ranges between 0.27 and 0.41. The improve- tion, in: 2021 IEEE 2nd International Conference on ments of each approach on an external speaker and five Pattern Recognition and Machine Learning (PRML), SPC speakers are similar. 2021, pp. 204–207. doi:10.1109/PRML52754.2021. 9520740 . 6. Outlook [9] L. Pan, S. Li, L. Wang, J. Dang, Effective train- ing end-to-end asr systems for low-resource lhasa Our adaptation approaches were tested only on one kind dialect of tibetan language, in: 2019 Asia- of model. It would be interesting to extend the evalua- Pacific Signal and Information Processing As- tion to different models and examine if the behaviour and sociation Annual Summit and Conference (AP- results are similar. Furthermore, the evaluation sample SIPA ASC), 2019, pp. 1152–1156. doi:10.1109/ size n=6 is not quite representative considering the di- APSIPAASC47483.2019.9023100 . versity of Swiss German. An evaluation containing more [10] A. Khosravani, P. N. Garner, A. Lazaridis, Learning speakers will allow more solid claims. to translate low-resourced swiss german dialectal speech into standard german text, in: 2021 IEEE Automatic Speech Recognition and Understanding References Workshop (ASRU), 2021, pp. 817–823. doi:10.1109/ ASRU51503.2021.9688249 . [1] D. Jurafsky, J. H. Martin, Speech and Lan- [11] Y. Arabskyy, A. Agarwal, S. Dey, O. Koller, Dialec- guage Processing (3rd Edition Draft), USA, 2021. tal speech recognition and translation of swiss ger- URL: https://web.stanford.edu/~jurafsky/slp3/ man speech to standard german text: Microsoft’s ed3book_jan122022.pdf . submission to swisstext 2021, 2021. URL: https:// [2] M. Plüss, L. Neukom, C. Scheller, M. Vogel, arxiv.org/abs/2106.08126 . doi:10.48550/ARXIV. Swiss parliaments corpus, an automatically aligned 2106.08126 . swiss german speech to standard german text [12] T. Samardžić, Y. Scherrer, E. Glaser, ArchiMob - corpus (2021). URL: http://arxiv.org/abs/2010. a corpus of spoken swiss german, in: Proceed- 02810 . arXiv:2010.02810 . ings of the Tenth International Conference on Lan- [3] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, guage Resources and Evaluation (LREC’16), Eu- J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, R. Pang, ropean Language Resources Association (ELRA), Conformer: Convolution-augmented transformer 2016, pp. 4061–4066. URL: https://aclanthology. for speech recognition, 2020. URL: https:// org/L16- 1641 . arxiv.org/abs/2005.08100 . doi:10.48550/ARXIV. 2005.08100 . [4] A. Baevski, H. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A framework for self-supervised learn- ing of speech representations (2020). URL: http: //arxiv.org/abs/2006.11477 . arXiv:2006.11477 . [5] J. Sun, G. Zhou, H. Yang, M. Wang, End-to-end ti- betan ando dialect speech recognition based on hy- brid ctc/attention architecture, in: 2019 Asia-Pacific