<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Dialectal Speech Recognition and Translation of Swiss German Speech to Standard German Text: Microsoft&apos;s Submission to SwissText 2021</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Yuriy</forename><surname>Arabskyy</surname></persName>
							<email>yuarabsk@microsoft.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Microsoft -Munich</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Aashish</forename><surname>Agarwal</surname></persName>
							<email>t-aagarwal@microsoft.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Microsoft -Munich</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Subhadeep</forename><surname>Dey</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Microsoft -Munich</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Oscar</forename><surname>Koller</surname></persName>
							<email>oskoller@microsoft.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Microsoft -Munich</orgName>
								<address>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Dialectal Speech Recognition and Translation of Swiss German Speech to Standard German Text: Microsoft&apos;s Submission to SwissText 2021</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">ECD3B3F0A0FF326D84565AEF426DB27F</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T19:36+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper describes the winning approach in the Shared Task 3 at SwissText 2021 on Swiss German Speech to Standard German Text, a public competition on dialect recognition and translation. Swiss German refers to the multitude of Alemannic dialects spoken in the German-speaking parts of Switzerland. Swiss German differs significantly from standard German in pronunciation, word inventory and grammar. It is mostly incomprehensible to native German speakers. Moreover, it lacks a standardized written script. To solve the challenging task, we propose a hybrid automatic speech recognition system with a lexicon that incorporates translations, a 1 st pass language model that deals with Swiss German particularities, a transfer-learned acoustic model and a strong neural language model for 2 nd pass rescoring. Our submission reaches 46.04% BLEU on a blind conversational test set and outperforms the second best competitor by a 12% relative margin.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>While general speech recognition has matured to a point where it surpasses human performance on specific datasets <ref type="bibr" target="#b21">(Xiong et al., 2016)</ref>, dialectal recognition as in the case of Swiss German <ref type="bibr" target="#b15">(Nigmatulina et al., 2020)</ref> or Arabic dialects <ref type="bibr" target="#b1">(Ali et al., 2021;</ref><ref type="bibr" target="#b11">Hussein et al., 2021;</ref><ref type="bibr" target="#b2">Ali, 2018)</ref> still represents a major challenge. Swiss German refers to the multitude of Alemannic dialects spoken in the German-speaking parts of Switzerland. It hence represents dialects that differ significantly from standard or high German in pronunciation, word inventory and grammar. Moreover, it lacks a standardized writing system. High German is commonly used for the large majority of written communication between people and in the media of German-speaking Switzerland, while in rather informal chats and short messaging Swiss people may use transliterated non-standardized Swiss dialect. The process of transcribing Swiss German into high German text therefore requires speech recognition with an inherent translation step. Moreover, the task can be considered low-resource as available data remains extremely scarce.</p><p>In previous studies, Garner et al. ( <ref type="formula">2014</ref>) tackled this challenge by training hybrid models (HMM-GMM, HMM-DNN, and KL-HMM) to transcribe Walliserdeutsch, a Swiss-German dialect spoken in the south-western alpine canton of Switzerland and further used a phrase-based machine translation model to translate it to standard German. Following this, other researchers explored techniques to add the translation step in the lexicon by directly mapping Swiss-German pronunciation to standard German. <ref type="bibr" target="#b19">Stadtschnitzer and Schmidt (2018)</ref> estimated Swiss German pronunciations from a standard German speech recognition model using a data-driven technique, and trained stronger TDNN-LSTM based acoustic models. <ref type="bibr" target="#b12">Kew et al. (2020)</ref> and <ref type="bibr" target="#b15">Nigmatulina et al. (2020)</ref> trained transformer-based G2P models from standard German to Swiss pronunciations and trained a Kaldi-based TDNN+ivector system using the WSJ recipe<ref type="foot" target="#foot_0">1</ref> . Yet a third approach is to directly apply end-to-end deep learning models. <ref type="bibr" target="#b5">Büchi et al. (2020)</ref> and <ref type="bibr" target="#b0">Agarwal and Zesch (2020)</ref> at Swiss-Text 2020 used the Jasper architecture <ref type="bibr" target="#b14">(Li et al., 2019)</ref> and Mozilla DeepSpeech <ref type="bibr" target="#b9">(Hannun et al., 2014)</ref>, respectively. In both cases the system was first trained on high German data and then transferlearned to Swiss German.</p><p>In this paper, we describe our proposal to solve the challenging task of transcribing Swiss German speech to standard German text. It won the competition at SwissText 2021 with a large margin to other competing systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">System Overview</head><p>In this section, we propose our changes to a conventional hybrid <ref type="bibr" target="#b4">(Bourlard and Dupont, 1996)</ref> automatic speech recognition (ASR) system, which relies on lexicon and alignments for good performance. We present details in order to enable it for dialect speech recognition and translation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Data</head><p>To train our proposed model, we utilized a selection of publicly available and internal datasets. Our starting point was the Swiss Parliament Corpus V2 dataset <ref type="bibr" target="#b17">(Plüss et al., 2020)</ref> shared as part of the SwissText 2021 competition. It covers 293 hours and contains recordings from the local parliament of the Kanton Bern. Its transcripts are in standard German while the audio covers Swiss German (predominantly in the Bernese dialect). The dataset has been preprocessed by the publishers with the purpose of cleaning its annotations and ensuring a good match between audio content and transcription. It is provided with a choice of different preprocessing flavors. We used the train_all split. In addition, we used a 493-hour internal dataset representing a media domain encompassing conversational speech from interviews, discussions, podcasts and others. A subset (around 50 hours) of the data is annotated with both Swiss transliterations as well as standard German. The remaining data has only been annotated with standard German. Additionally, we used an internal high German dataset encompassing around 10k hours to pre-train our model.</p><p>In terms of test data, the SwissText 2021 competition was accompanied by a 13 hours conversational test set covering Swiss German speakers from all German-speaking parts of Switzerland. The encountered dialectal distribution is claimed to closely match the real distribution in Switzerland. The set was not disclosed to the participants. Hence, for the analysis in this paper, we report our numbers on a publicly available test set part of the dataset from the Bernese parliament <ref type="bibr" target="#b17">(Plüss et al., 2020)</ref>. It comprises 6 hours of dialectal speech.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Lexicon</head><p>We propose to incorporate the translation from Swiss German to standard German as part of the lexicon. However, this leads to a complex and often ambiguous mapping from phoneme to grapheme sequences, which is very different from languages with a direct relation between writing scheme and pronunciation (e.g. English or standard German). Subsequently, statistical models that map graphemes to phonemes (G2P) trained on Swiss German data incorporating such translations yield much noisier output with significantly higher phone error rates as compared to G2Ps for standard languages. To mitigate this problem, we construct the lexicon in several stages.</p><p>In a first step, we make use of parallel corpora encompassing Swiss and standard German annotations to extract word mappings between Swiss and standard German. Sophisticated filtering methods help to ensure a high quality of these mappings. We opt for frequency filtering and filtering based on vicinity in a word embedding space <ref type="bibr" target="#b3">(Bojanowski et al., 2017)</ref> of Swiss German words taking the most frequent mapping as center point.</p><p>In a second step, a standard German G2P model is applied to convert Swiss German transliterations into corresponding phone sequences. This results in a dictionary that maps standard German words to Swiss pronunciations. Jointly with existing Swiss German lexicon resources <ref type="bibr" target="#b18">(Schmidt et al., 2020)</ref>, the previously generated mappings are then used to train a dedicated Swiss German G2P model.</p><p>We evaluate the quality of the resulting G2P model on a manually labeled test set. Those cover mappings from standard German words to Swiss German phone sequences and encompass a variety of relevant categories such as diminuation, shortening or translation. Refer to Table <ref type="table" target="#tab_0">1</ref> for samples of the assessed categories.</p><p>The Swiss G2P model allows to find suitable pronunciations for the relevant word inventory present in the acoustic and language model training corpora. However, to further increase the quality of the given pronunciations, data-driven lexicon learning techniques <ref type="bibr" target="#b22">(Zhang et al., 2017)</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Language Model</head><p>Incorporating the translation from Swiss German to standard German as part of the lexicon introduces significant ambiguity in the decoding process. To counteract, we suggest using a strong standard German language model (LM) which helps to produce accurate hypotheses. We employ a first pass countbased LM to output up to 100 sentence hypotheses and a second pass neural LSTM (long short-term memory) LM <ref type="bibr" target="#b20">(Sundermeyer et al., 2012)</ref> for rescoring <ref type="bibr" target="#b7">(Deoras et al., 2011)</ref>. The first pass model is a 5-gram LM trained on large amounts of standard German text corpora totalling to over 100 billion words. We apply Kneser-Ney smoothing <ref type="bibr" target="#b13">(Kneser and Ney, 1995)</ref>.</p><p>Furthermore, we make some adjustments to better deal with Swiss German particularities, as described in the following paragraphs.</p><p>Compounds: German is a compounding language and tends to compose words (particularly nouns) of several smaller subwords. The resulting chains of word stems can lead to an infinitely large vocabulary size with words that occur very infrequently throughout the corpus. This spreading of probability mass weakens the LM. We hence decompound all compounded words in the training corpus and split them into subwords.</p><p>Clitics: Swiss German tends to merge words beyond compounding, not preserving word stems <ref type="bibr" target="#b10">(Hollenstein and Aepli, 2014)</ref>. For instance, the Swiss German 'hemmer' is the translation of 'haben wir' in standard German (English: 'have we'). We identified approximately 8000 clitics in our corpus. We incorporate them in the decoding process by updating lexicon and LM. Following the example above the translated clitic 'haben#wir' with the corresponding Swiss pronunciation is added to the lexicon. As for the LM, we merge occurrences of relevant word pairs and interpolate with the unmerged LM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Acoustic model</head><p>The acoustic model is trained with 80 dimensional log-mel filterbank features, with a processing window of 25ms and 10ms frame shift. The feature vector from the previous frame is concatenated with the current frame to obtain a 160 dimensional vector. We used a LC-BLSTM (latency controlled bidirectional long short-term memory) based acoustic model, that is popularly applied in speech recognition for controlling decoding latency to a few frames <ref type="bibr" target="#b6">(Chen and Huo, 2016)</ref>. The model was trained with alignments from a feed-forward network with context-dependent tied states (senones). The model has ∼9k senone units. The LC-BLSTM is trained with 6 hidden layers with 512 units each. The hidden vectors from the forward and backward propagation were concatenated and then projected to a 512 dimensional vector. The model is trained with a cross entropy loss function. The decoding lexicon is extended with Swiss German words for training. The transliterations are used during forced alignment whenever possible. This helps to reduce the pronunciation ambiguity in the alignment phase and is especially helpful in the early training phases when no strong model is available for alignment.</p><p>The results are reported in terms of BLEU <ref type="bibr" target="#b16">(Papineni et al., 2002)</ref> and word error rate (WER) on the Swiss Parliament test set described in Section 2.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Results and Discussions</head><p>An ablation study of the proposed approaches is presented in Table <ref type="table" target="#tab_1">2</ref>. All of the performance gains in this section will be reported as relative percentage improvements, while aforementioned table contains absolute numbers.</p><p>We first evaluate the effect of transfer learning on the results with the Swiss Parliament training set. It can be observed that it significantly helps to improve both WER and BLEU. In particular, the transfer-learned model (row 2, Table <ref type="table" target="#tab_1">2</ref>) improves over the model trained from scratch (row 1, Finally, 2 nd Pass rescoring is applied as described in Section 2.3 to reorder the top 100 hypotheses. It can be observed from row 4, Table <ref type="table" target="#tab_1">2</ref> that rescoring helps to improve the performance by 5.8% WER and 8.9% BLEU.</p><p>Our submission to SwissText 2021 achieves 46.04% BLEU on the official SwissText blind test set. This leads to a 12% relative margin in BLEU with respect to the second best competitor which was 40.99%.</p><p>The acoustic models have been trained using 8 GPUs for 25 epochs. This results in a total training time of around 400 GPU-hours when training on Swiss Parliament only and about 1200 GPU-hours when adding the internal data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Conclusion and Future Work</head><p>In this paper, we described a speech recognition system that achieves strong results on the task of recognizing Swiss German dialect and translating it into standard German text. We proposed a hybrid ASR system with a lexicon that incorporates translations, a 1 st pass language model that deals with Swiss German word compounding and clitics, an acoustic model that is transfer-learned from standard German resources and a strong neural language model for 2 nd pass rescoring to smoothen translation artifacts. Furthermore, we provided an ablation study that allows to infer the effect of adding training data, performing transfer learning and 2 nd pass rescoring. Our submission reached 46.04% BLEU on a challenging conversational test set and outperformed all competing approaches by a large margin.</p><p>In terms of future work, we would like to investigate word re-orderings as part of the translation, which our current model does not actively support. For instance, Swiss German frequently moves verbs in relative clauses to different positions with respect to the standard German word order. Furthermore, sequence discriminative training is a promising route for exploration as well as using unsupervised data for acoustic model training.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>are applied. Those help to identify and correct noisy lexicon entries. Example words and pronunciations from each G2P test condition</figDesc><table><row><cell cols="2">2nd person plural 2nd person sing</cell><cell>Diminuation</cell><cell>Shortening</cell><cell cols="2">Translation Variability</cell></row><row><cell>fragt</cell><cell>fragst</cell><cell>erdmännchen</cell><cell>gymnasium</cell><cell>kopf</cell><cell>kannst</cell></row><row><cell>f hr a_ g ax t</cell><cell>f hr a_ k sh</cell><cell>e_r t m eh n l i_</cell><cell>g ih m i_</cell><cell>g hr ih n t</cell><cell>k a sh</cell></row><row><cell>riecht</cell><cell>riechst</cell><cell>gläschen</cell><cell>schwimmbad</cell><cell>kneipe</cell><cell>zweites</cell></row><row><cell>sh m oe k c ax t</cell><cell>sh m oe k sh</cell><cell>g l e_ s l i_</cell><cell>b a_ d ih</cell><cell>b ai ts</cell><cell>ts v ai t</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc></figDesc><table><row><cell>)</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2 :</head><label>2</label><figDesc>Performance in [%] of different system configurations evaluated on the Swiss Parliament test set.</figDesc><table><row><cell>that a well-trained German model can effectively</cell></row><row><cell>boost the limited resources of Swiss German.</cell></row><row><cell>Further adding additional internal training data</cell></row><row><cell>shows additional gains in performance. As such,</cell></row><row><cell>we observe that the WER improves by 2.3% and</cell></row><row><cell>BLEU by 2.5%.</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://github.com/kaldi-asr/kaldi/ tree/master/egs/wsj</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">Aashish</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Torsten</forename><surname>Zesch</surname></persName>
		</author>
		<title level="m">Ltl-ude at low-resource speech-to-text shared task: Investigating mozilla deepspeech in a low-resource setting</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Connecting Arabs: Bridging the gap in dialectal speech recognition</title>
		<author>
			<persName><forename type="first">Ahmed</forename><surname>Ali</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shammur</forename><surname>Chowdhury</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mohamed</forename><surname>Afify</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wassim</forename><surname>El-Hajj</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hazem</forename><surname>Hajj</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mourad</forename><surname>Abbas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Amir</forename><surname>Hussein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nada</forename><surname>Ghneim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mohammad</forename><surname>Abushariah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Assal</forename><surname>Alqudah</surname></persName>
		</author>
		<idno type="DOI">10.1145/3451150</idno>
	</analytic>
	<monogr>
		<title level="j">Communications of the ACM</title>
		<imprint>
			<biblScope unit="volume">64</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="124" to="129" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Multi-Dialect Arabic Broadcast Speech Recognition</title>
		<author>
			<persName><forename type="first">Ahmed</forename><surname>Mohamed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Abdel</forename><forename type="middle">Maksoud</forename><surname>Ali</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2018">2018</date>
			<pubPlace>Edinburgh, UK</pubPlace>
		</imprint>
		<respStmt>
			<orgName>University of Edinburgh</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Ph.D. thesis</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Enriching word vectors with subword information</title>
		<author>
			<persName><forename type="first">Piotr</forename><surname>Bojanowski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Edouard</forename><surname>Grave</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Armand</forename><surname>Joulin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tomas</forename><surname>Mikolov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page" from="135" to="146" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A new ASR approach based on independent processing and recombination of partial frequency bands</title>
		<author>
			<persName><forename type="first">Hervé</forename><surname>Bourlard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stéphane</forename><surname>Dupont</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. Int. Conf. on Spoken Language Processing (ICSLP)</title>
				<meeting>Int. Conf. on Spoken Language essing (ICSLP)</meeting>
		<imprint>
			<date type="published" when="1996">1996</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="426" to="429" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">Matthias</forename><surname>Büchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anna</forename><surname>Malgorzata</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Manuela</forename><surname>Ulasik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fernando</forename><surname>Hürlimann</surname></persName>
		</author>
		<author>
			<persName><surname>Benites</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mark</forename><surname>Pius Von Däniken</surname></persName>
		</author>
		<author>
			<persName><surname>Cieliebak</surname></persName>
		</author>
		<title level="m">Zhaw-init at germeval 2020 task 4: Low-resource speech-to-text</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Training deep bidirectional lstm acoustic model for lvcsr by a contextsensitive-chunk bptt approach</title>
		<author>
			<persName><forename type="first">Kai</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Qiang</forename><surname>Huo</surname></persName>
		</author>
		<idno type="DOI">10.1109/TASLP.2016.2539499</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE/ACM Transactions on Audio, Speech, and Language Processing</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<biblScope unit="issue">7</biblScope>
			<biblScope unit="page" from="1185" to="1193" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">A Fast Re-scoring Strategy to Capture Long-Distance Dependencies</title>
		<author>
			<persName><forename type="first">Anoop</forename><surname>Deoras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tomáš</forename><surname>Mikolov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kenneth</forename><surname>Church</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2011 Conference on Empirical Methods in Natural Language Processing<address><addrLine>Edinburgh, Scotland, UK</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="1116" to="1127" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Automatic Speech Recognition and Translation of a Swiss German Dialect: Walliserdeutsch</title>
		<author>
			<persName><forename type="first">N</forename><surname>Philip</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><surname>Garner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Thomas</forename><surname>Imseng</surname></persName>
		</author>
		<author>
			<persName><surname>Meyer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the Ann. Conf. of the Int. Speech Commun. Assoc. (Interspeech)</title>
				<meeting>of the Ann. Conf. of the Int. Speech Commun. Assoc. (Interspeech)<address><addrLine>Singapore</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Deep speech: Scaling up end-to-end speech recognition</title>
		<author>
			<persName><forename type="first">Awni</forename><surname>Hannun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Carl</forename><surname>Case</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jared</forename><surname>Casper</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bryan</forename><surname>Catanzaro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Greg</forename><surname>Diamos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Erich</forename><surname>Elsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ryan</forename><surname>Prenger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sanjeev</forename><surname>Satheesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shubho</forename><surname>Sengupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Adam</forename><surname>Coates</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1412.5567</idno>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Compilation of a Swiss German dialect corpus and its application to PoS tagging</title>
		<author>
			<persName><forename type="first">Nora</forename><surname>Hollenstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Noëmi</forename><surname>Aepli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects</title>
				<meeting>the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="85" to="94" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">Amir</forename><surname>Hussein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shinji</forename><surname>Watanabe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ahmed</forename><surname>Ali</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2101.08454</idno>
		<title level="m">Arabic Speech Recognition by End-to-End, Modular Systems and Human</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note>cs, eess</note>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Uzh tilt: A kaldi recipe for swiss german speech to standard german text</title>
		<author>
			<persName><forename type="first">Tannon</forename><surname>Kew</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Iuliia</forename><surname>Nigmatulina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lorenz</forename><surname>Nagele</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tanja</forename><surname>Samardzic</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Improved backing-off for m-gram language modeling</title>
		<author>
			<persName><forename type="first">Reinhard</forename><surname>Kneser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hermann</forename><surname>Ney</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP)</title>
				<meeting>IEEE Int. Conf. on Acoustics, Speech and Signal essing (ICASSP)</meeting>
		<imprint>
			<date type="published" when="1995">1995</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="181" to="184" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">Jason</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vitaly</forename><surname>Lavrukhin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Boris</forename><surname>Ginsburg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ryan</forename><surname>Leary</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Oleksii</forename><surname>Kuchaiev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jonathan</forename><forename type="middle">M</forename><surname>Cohen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Huyen</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ravi</forename><surname>Teja</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Gadde</forename></persName>
		</author>
		<idno type="arXiv">arXiv:1904.03288</idno>
		<title level="m">Jasper: An end-to-end convolutional neural acoustic model</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">ASR for non-standardised languages with dialectal variation: the case of Swiss German</title>
		<author>
			<persName><forename type="first">Iuliia</forename><surname>Nigmatulina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tannon</forename><surname>Kew</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tanja</forename><surname>Samardzic</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects</title>
				<meeting>the 7th Workshop on NLP for Similar Languages, Varieties and Dialects<address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="15" to="24" />
		</imprint>
	</monogr>
	<note>International Committee on Computational Linguistics (ICCL)</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Bleu: a method for automatic evaluation of machine translation</title>
		<author>
			<persName><forename type="first">Kishore</forename><surname>Papineni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Salim</forename><surname>Roukos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Todd</forename><surname>Ward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wei-Jing</forename><surname>Zhu</surname></persName>
		</author>
		<idno type="DOI">10.3115/1073083.1073135</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics</title>
				<meeting>the 40th Annual Meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="311" to="318" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">Michel</forename><surname>Plüss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lukas</forename><surname>Neukom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Manfred</forename><surname>Vogel</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2010.02810</idno>
		<title level="m">Swiss parliaments corpus, an automatically aligned swiss german speech to standard german text corpus</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">A Swiss German Dictionary: Variation in Speech and Writing</title>
		<author>
			<persName><forename type="first">Larissa</forename><surname>Schmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lucy</forename><surname>Linder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sandra</forename><surname>Djambazovska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Alexandros</forename><surname>Lazaridis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tanja</forename><surname>Samardžić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Claudiu</forename><surname>Musat</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2004.00139</idno>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note>cs</note>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Adaptation and training of a swiss german speech recognition system using data-driven pronunciation modelling</title>
		<author>
			<persName><forename type="first">Michael</forename><surname>Stadtschnitzer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christoph</forename><surname>Schmidt</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of DAGA-44</title>
				<meeting>DAGA-44<address><addrLine>München, Germany</addrLine></address></meeting>
		<imprint>
			<publisher>Jahrestagung für Akustik</publisher>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">LSTM neural networks for language modeling</title>
		<author>
			<persName><forename type="first">Martin</forename><surname>Sundermeyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ralf</forename><surname>Schlüter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hermann</forename><surname>Ney</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the Ann. Conf. of the Int. Speech Commun. Assoc. (Interspeech)</title>
				<meeting>of the Ann. Conf. of the Int. Speech Commun. Assoc. (Interspeech)<address><addrLine>Portland, OR, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="194" to="197" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Achieving human parity in conversational speech recognition</title>
		<author>
			<persName><forename type="first">Wayne</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jasha</forename><surname>Droppo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xuedong</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Frank</forename><surname>Seide</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mike</forename><surname>Seltzer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andreas</forename><surname>Stolcke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dong</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Geoffrey</forename><surname>Zweig</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1610.05256</idno>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Acoustic Data-Driven Lexicon Learning Based on a Greedy Pronunciation Selection Framework</title>
		<author>
			<persName><forename type="first">Xiaohui</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vimal</forename><surname>Manohar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daniel</forename><surname>Povey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sanjeev</forename><surname>Khudanpur</surname></persName>
		</author>
		<idno type="DOI">10.21437/Interspeech.2017-588</idno>
	</analytic>
	<monogr>
		<title level="m">Proc. of the Ann. Conf. of the Int. Speech Commun. Assoc. (Interspeech)</title>
				<meeting>of the Ann. Conf. of the Int. Speech Commun. Assoc. (Interspeech)</meeting>
		<imprint>
			<publisher>ISCA</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="2541" to="2545" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
