=Paper=
{{Paper
|id=Vol-3361/paper6
|storemode=property
|title=Improved Dialect Recognition by Adaptation to a Single Speaker
|pdfUrl=https://ceur-ws.org/Vol-3361/paper6.pdf
|volume=Vol-3361
|authors=Manuel Vogel,Guido Kniesel,Alberto Calatroni,Andrew Paice
|dblpUrl=https://dblp.org/rec/conf/swisstext/VogelKCP22
}}
==Improved Dialect Recognition by Adaptation to a Single Speaker==
<pdf width="1500px">https://ceur-ws.org/Vol-3361/paper6.pdf</pdf>
<pre>
Improved Dialect Recognition by Adaptation to a Single
Speaker
Manuel Vogel1 , Guido Kniesel1 , Alberto Calatroni1 and Andrew Paice1
1
  iHomeLab Think Tank and Research Centre for Building Intelligence, Lucerne University of Applied Sciences and Arts (HSLU),
Technikumstrasse 21, CH-6048 Horw, Switzerland


                                          Abstract
                                          Voice assistants understanding dialects would help especially elderly people. Automatic Speech Recognition (ASR) performs
                                          poorly on dialects due to the lack of sizeable datasets. We propose three adaptation strategies which allow to improve an ASR
                                          model trained for German language to understand Swiss German spoken by a target speaker using as little as 1.5 hours of
                                          speaker data. Our best result was a word error rate (WER) of 0.27 for one individual.


1. Introduction                                                       instead of trying to learn several different dialects with
                                                                      a single model. We investigated different approaches of
Automatic Speech Recognition (ASR) refers to the task training/fine-tuning and we assessed the performance of
of converting an audio signal into its written transcrip- a pre-trained model adapted on a single speaker.
tion and finds application, among others, in voice assis-
tants. ASR performs well on so-called well-resourced1
languages, while results on dialects, specifically Swiss 2. Related Work
German, are poorer. This is particularly inconvenient for
the acceptance of applications involving smart assistants In recent studies, the application of end-to-end ASR mod-
for elderly people, for whom it might be a big nuisance els (from raw audio to the words) based on deep neural
to switch to Standard German. ASR for Swiss German is networks has shown a considerable performance boost.
challenging for several reasons:                                      To achieve good results, a considerable amount of train-
                                                                      ing data is needed [1]. In the case of Swiss German, there
               1. Swiss German has no standardized written form is a lack of enough data, variability and the appropriate
                   and Standard German is the output of choice, ground truth. An exception is the recently published
                   meaning that the system must provide speech Swiss Parliaments Corpus (SPC), which we use in our
                   translation (ST) rather than mere recognition. For work [2].
                   example, the German expression «wollen wir»           The two noteworthiest end-to-end architectures are
                   could be pronounced and written in several dif- Conformer [3] and wav2vec2 [4]. The latter achieved the
                   ferent variants in Swiss German, e.g. «wömmer», best WER to date in German ASR (WER 0.057). Therefore,
                  «wemmer», «wemmr» or «wämmer».                      we chose wav2vec2 as starting point.
               2. Swiss German dialects are diverse and not geo-         ASR systems for low-resource dialects have lately at-
                   graphically well confined. Thus, creating regional tracted some attention [5, 6, 7, 8, 9]. When looking at
                   models would be challenging.                       Swiss German, we find the work of Plüss et al. [2], who
               3. The publicly available Swiss German datasets are claim a WER of 0.289 using a Conformer model on a the
                   few and small compared to the corpora for other SPC dataset. Other researchers combined the SPC with
                   languages. Training on thousands of hours of a proprietary dataset to train various ST systems and
                   data to account for variability is not possible.   achieved a WER of 0.5 when using only the SPC dataset
                                                                      [10]. A further approach achieved a WER of 0.39 on the
              The contribution of this work is an exploration on how
                                                                      SPC by training a model on a German dataset, transfer
person-specific data can be used to tailor known mod-
                                                                      learning to the SPC enhanced with a proprietary internal
els towards better performance for a specific individual
                                                                      dataset and refining the classification with a re-scoring
SwissText 2022: Swiss Text Analytics Conference, June 08–10, 2022, [11]. Our evaluation yields results similar to Plüss et al
Lugano, Switzerland                                                   [2], even if a direct comparison is not possible, and gives
Envelope-Open manuel.vogel@hslu.ch (M. Vogel);                        interesting insights about different fine-tuning strategies.
guido.kniesel@hslu.ch (G. Kniesel);
alberto.calatroni@hslu.ch (A. Calatroni);
andrew.paice@hslu.ch (A. Paice)
                                    © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                    Attribution 4.0 International (CC BY 4.0).
                                                                                                                                   3. Materials and Methods
 CEUR
 Workshop
           CEUR Workshop Proceedings (CEUR-WS.org)
               http://ceur-ws.org
               ISSN 1613-0073


                                                                                                                                   We here describe the baseline model, datasets and adap-
 Proceedings


1
 Well-resourced refers to the availability of abundant labeled data
  corpora to train machine learning algorithms.                                                                                    tation approaches.
3.1. Model                                                     of audio; therefore, for a fair comparison, we limited all
                                                               six SSCs to 1.5 hours.
We base our work on a pre-trained wav2vec2 model2
available from the HuggingFace3 AI community. We
denote this model as baseline . The model topology con-        3.3. Approaches to Single Speaker
sists of convolutional layers which map the raw audio               Adaptation
to latent quantized speech representations and a Trans-
                                                               Our goal is to adapt the baseline model to perform ASR
former structure which maps to context representations.
                                                               satisfactorily with a single target speaker. We propose
The first pre-training stage involves self-supervised learn-
                                                               three approaches, which we evaluated on six SSC:
ing and therefore does not need labeled data [4]. For
the baseline model, pre-training is done with multilin-            1. Supervised training of the baseline model on
gual data (53 languages) to learn language-independent                the SPC excluding all target SSCs.
speech units, followed by supervised training with Ger-            2. Fine-tune the baseline model with data from the
man data, since Swiss German has strong similarities                  target SSC.
with German.                                                       3. Combine the two previous approaches by training
                                                                      the baseline model on SPC excluding all target
3.2. Dataset                                                          SSCs, then fine-tune with the target SSC.
In our experiments we used the Swiss Parliaments Corpus           In addition, we also evaluate the baseline model on
(SPC), a Swiss German dataset that contains recordings         all target SSCs. The three approaches are visualized in
and transcriptions of the cantonal parliament of Bern          Figure 1.
(Grosser Rat Kanton Bern) [2]. It contains 293 hours of
audio by 198 speakers and represents the biggest Swiss
German speech recognition dataset to date. We use a
                                                               4. Results
subset of the SPC containing only samples with a high     We conducted several experiments in line with the ap-
alignment between text and audio4 . The audio files con-  proaches described in Section 3.3 and evaluated the mod-
tain mostly Swiss German speech, whereas the labels       els against held-out test data, reporting the word error
(transcriptions) are in Standard German. We chose SPC     rate (WER). We show the results in Table 1. The best
mainly because of its size. In comparison to existing     approach for adapting to a single speaker is the last one,
Swiss German datasets, such as ArchiMob [12], SPC has     i.e., to first train the model on Swiss German data from
more audio data, which allows us to experiment better     several speakers and then on the corresponding target
with various sizes of the single-speaker datasets. In ad- speaker dataset. Interestingly, fine-tuning the baseline
dition, we recorded a new small dataset from a speaker    model only with the target single speaker datasets gives
unrelated to the SPC dataset, which allows us to test our worse results compared to training a model on multiple
approaches on another context. It is based on utterances  speakers (SPC-without-top5 ). However, it is important
of the Voxforge speech corpora5 . We call this the external
                                                          to note that the SSCs contain only 1.5 hours of data,
speaker (shortened: «ext»).                               whereas SPC-without-top5 contains around 176 hours.
                                                             The individual improvement of the adaptation depends
3.2.1. Dataset Partitioning                               on the speaker and varies between 1% and 4% on the five
From the SPC dataset we created convenient partitions SPC speakers and reaches a notable 14% on the external
for the experiments. In the original corpus we identified speaker. Speaker 82 has the highest WER when eval-
the five speakers that have the biggest amount of data. uated with the base model but the lowest WER when
These are the speakers with IDs 82, 145, 177, 186 and fine-tuning with multiple Swiss German speakers and/or
207. Together with our own small external dataset, this the single speaker dataset of speaker 82. In contrast,
yields six datasets. We refer to them as Single Speaker speaker 207 has the lowest WER when evaluated with
Corpora (SSC). For the approaches which involve a train- the base model, but the highest WER using the other
ing step with multiple speakers, we extracted a subset three approaches. The reasons for this behaviour could
of the SPC which excludes the speakers identified above not be fully determined and further investigations are
(SPC-without-top5 ). Among the single speakers, the future work.
one with the least amount of data has around 1.5 hours
                                                               4.1. Influence of Training Data Amount
2
  The exact model used is wav2vec2-large-xlsr-53-german .      Increasing the data for the single speaker training has
3
  https://huggingface.co/                                      not led to a significant reduction of the WER. When
4
  Intersection over Union (IoU) > 0.9, train_0.9
5                                                              training the model resulting from the second approach
  http://www.voxforge.org/de/downloads
Figure 1: Visualization of the three adaptation approaches. SPC-without-top5 denotes the SPC without all five target
speakers and X denotes a speaker ID.


                       Approach      Fine-Tuning       82     186    207     177     145    ext
                        Baseline           -          0.90    0.89   0.82    0.89    0.86   0.82
                           1             SSC          0.56    0.67   0.60    0.64    0.60   0.50
                           2             SPC          0.31    0.42   0.44    0.35    0.38   0.44
                           3           SPC+SSC        0.27    0.39   0.41    0.34    0.36   0.30
Table 1
Word error rates (WER) for different speakers and training strategies. The numeric column headers are the SPC speaker IDs
and ext denotes the external speaker. «SSC» and «SPC» stand for Single Speaker Corpus and Swiss Parliaments Corpus
respectively. The first row shows the results of the pre-trained model.


with six hours of audio from speaker 82 instead of only      0.27 on the same test set, giving an improvement of 17%.
1.5 hours the WER decreases only by 2%. An identical         Training the baseline model on SPC-without-top5 and
improvement is observed when training with four hours        then on 0.25 hours of data of the external speaker still
of speaker 207 instead of 1.5 hours. Decreasing the time     performs better than using the model trained only on
used for single speaker training does increase the WER:      SPC-without-top5 and the model trained only with 1.5
When training with a third of the external single speaker    hours of data of the external speaker.
dataset, the WER increases by 4% and when training with
a sixth, the WER increases by 5%.                            4.3. Limitations
                                                          One limitation is the prevalence of one specific dialect
4.2. Impact of Multi-Speaker Fine-Tuning
                                                          (Bernese) in the SPC. Furthermore, the SPC was recorded
A remarkable result is the impact of Swiss German fine- in a parliament and has therefore a certain bias in terms
tuning before the single speaker adaptation. Training the of content. The results can also be influenced by the
baseline model on the full SSC of speaker 82 (six hours), combination of the chosen metric and the ground truth.
it achieves a WER of only 0.44. In comparison, training For instance, if the audio contains the phrase session vom
the model first with SPC-without-top5 and then fine- september and the label is septembersession, the WER in-
tuning with 1.5 hours of speaker 82, achieves a WER of creases if the model predicts the former phrase, even if
the two options are semantically identical. In addition,           Signal and Information Processing Association An-
Swiss German does not have a past simple tense. Con-               nual Summit and Conference (APSIPA ASC), 2019,
sequently, if the label is written in past simple, there is        pp. 628–632. doi:10.1109/APSIPAASC47483.2019.
a significant difference in the structure of the spoken            9023130 .
sentence and the ground truth.                                 [6] R. Imaizumi, R. Masumura, S. Shiota, H. Kiya,
                                                                   Dialect-aware modeling for end-to-end japanese
                                                                   dialect speech recognition, in: 2020 Asia-Pacific
5. Conclusion                                                      Signal and Information Processing Association An-
                                                                   nual Summit and Conference (APSIPA ASC), 2020,
We presented three possible strategies to adapt a pre-
                                                                   pp. 297–301.
trained ASR model based on wav2vec2 to enhance the
                                                               [7] Y. Zhao, J. Yue, X. Xu, L. Wu, X. Li, End-to-
recognition on a single Swiss-German-speaking individ-
                                                                   end-based tibetan multitask speech recognition,
ual. The best strategy appears to be training a baseline
                                                                   IEEE Access 7 (2019) 162519–162529. doi:10.1109/
model with multiple Swiss German speakers and in a
                                                                   ACCESS.2019.2952406 .
second phase fine-tuning with a small amount of data
                                                               [8] Y. Zhang, M. Ablimit, A. Hamdulla, Error correction
from the target speaker. With this strategy, the WER for
                                                                   based on transformer lm in uyghur speech recogni-
six speakers ranges between 0.27 and 0.41. The improve-
                                                                   tion, in: 2021 IEEE 2nd International Conference on
ments of each approach on an external speaker and five
                                                                   Pattern Recognition and Machine Learning (PRML),
SPC speakers are similar.
                                                                   2021, pp. 204–207. doi:10.1109/PRML52754.2021.
                                                                   9520740 .
6. Outlook                                                     [9] L. Pan, S. Li, L. Wang, J. Dang, Effective train-
                                                                   ing end-to-end asr systems for low-resource lhasa
Our adaptation approaches were tested only on one kind             dialect of tibetan language,       in: 2019 Asia-
of model. It would be interesting to extend the evalua-            Pacific Signal and Information Processing As-
tion to different models and examine if the behaviour and          sociation Annual Summit and Conference (AP-
results are similar. Furthermore, the evaluation sample            SIPA ASC), 2019, pp. 1152–1156. doi:10.1109/
size n=6 is not quite representative considering the di-           APSIPAASC47483.2019.9023100 .
versity of Swiss German. An evaluation containing more        [10] A. Khosravani, P. N. Garner, A. Lazaridis, Learning
speakers will allow more solid claims.                             to translate low-resourced swiss german dialectal
                                                                   speech into standard german text, in: 2021 IEEE
                                                                   Automatic Speech Recognition and Understanding
References                                                         Workshop (ASRU), 2021, pp. 817–823. doi:10.1109/
                                                                   ASRU51503.2021.9688249 .
 [1] D. Jurafsky, J. H. Martin, Speech and Lan-
                                                              [11] Y. Arabskyy, A. Agarwal, S. Dey, O. Koller, Dialec-
     guage Processing (3rd Edition Draft), USA, 2021.
                                                                   tal speech recognition and translation of swiss ger-
     URL: https://web.stanford.edu/~jurafsky/slp3/
                                                                   man speech to standard german text: Microsoft’s
     ed3book_jan122022.pdf .
                                                                   submission to swisstext 2021, 2021. URL: https://
 [2] M. Plüss, L. Neukom, C. Scheller, M. Vogel,
                                                                   arxiv.org/abs/2106.08126 . doi:10.48550/ARXIV.
     Swiss parliaments corpus, an automatically aligned
                                                                   2106.08126 .
     swiss german speech to standard german text
                                                              [12] T. Samardžić, Y. Scherrer, E. Glaser, ArchiMob -
     corpus (2021). URL: http://arxiv.org/abs/2010.
                                                                   a corpus of spoken swiss german, in: Proceed-
     02810 . arXiv:2010.02810 .
                                                                   ings of the Tenth International Conference on Lan-
 [3] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang,
                                                                   guage Resources and Evaluation (LREC’16), Eu-
     J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, R. Pang,
                                                                   ropean Language Resources Association (ELRA),
     Conformer: Convolution-augmented transformer
                                                                   2016, pp. 4061–4066. URL: https://aclanthology.
     for speech recognition, 2020. URL: https://
                                                                   org/L16- 1641 .
     arxiv.org/abs/2005.08100 . doi:10.48550/ARXIV.
     2005.08100 .
 [4] A. Baevski, H. Zhou, A. Mohamed, M. Auli,
     wav2vec 2.0: A framework for self-supervised learn-
     ing of speech representations (2020). URL: http:
     //arxiv.org/abs/2006.11477 . arXiv:2006.11477 .
 [5] J. Sun, G. Zhou, H. Yang, M. Wang, End-to-end ti-
     betan ando dialect speech recognition based on hy-
     brid ctc/attention architecture, in: 2019 Asia-Pacific

</pre>