=Paper=
{{Paper
|id=Vol-2086/AICS2017_paper_14
|storemode=property
|title=Pinyin as Subword Unit for Chinese-Sourced Neural Machine Translation
|pdfUrl=https://ceur-ws.org/Vol-2086/AICS2017_paper_14.pdf
|volume=Vol-2086
|authors=Jinhua Du,Andy Way
|dblpUrl=https://dblp.org/rec/conf/aics/DuW17
}}
==Pinyin as Subword Unit for Chinese-Sourced Neural Machine Translation==
Pinyin as Subword Unit for Chinese-Sourced Neural Machine Translation Jinhua Du†‡ , Andy Way† † ADAPT Centre, School of Computing, Dublin City University, Ireland ‡ Accenture Labs, Dublin, Ireland {jinhua.du, andy.way}@adaptcentre.ie Abstract. Unknown word (UNK) or open vocabulary is a challenging problem for neural machine translation (NMT). For alphabetic languages such as English, German and French, transforming a word into subwords is an effective way to al- leviate the UNK problem, such as the Byte Pair encoding (BPE) algorithm. How- ever, for the stroke-based languages, such as Chinese, aforementioned method is not effective enough for translation quality. In this paper, we propose to utilize Pinyin, a romanization system for Chinese characters, to convert Chinese char- acters to subword units to alleviate the UNK problem. We first investigate that how Pinyin and its four diacritics denoting tones affect translation performance of NMT systems, and then propose different strategies to utilise Pinyin and tones as input factors for Chinese–English NMT. Extensive experiments conducted on Chinese–English translation demonstrate that the proposed methods can remark- ably improve the translation quality, and can effectively alleviate the UNK prob- lem for Chinese-sourced translation. 1 Introduction In recent years, NMT has made impressive progress [8, 3, 20, 1, 4]. The state-of-the-art NMT model employs an encoder–decoder architecture with an attention mechanism, in which the encoder summarizes the source sentence into a vector representation, and the decoder produces the target string word by word from vector representations, and the attention mechanism learns the soft alignment of a target word against source words [1]. NMT systems have outperformed the state-of-the-art SMT model on various language pairs in terms of translation quality [13, 2, 7, 22, 21, 5]. The translation of rare words is not only an open problem for statistical machine translation (SMT), but also for NMT. Current NMT systems take a fixed vocabulary for the input and output sequences, and the rare words in the data are denoted as a symbol “UNK”, which will make the translation inaccurate and disfluent to some extent. The vocabulary of neural models is typically limited to 30,000–50,000 words, but trans- lation is an open-vocabulary problem, especially for languages with productive word formation processes, such as agglutination and compounding. In these cases, transla- tion models require mechanisms that go below the word level [19]. Recent work has been done in improving the generalisation capability of NMT for open vocabulary [14, 6, 11, 12, 19, 10]. For example, translation of the out-of-vocabulary (OOV) words can be regarded as a post-processing step as in SMT, i.e. keeping the OOVs in the hypothesis, and then using a bilingual dictionary to obtain translations of these OOVs. The deficiency of this back-off dictionary method is that it needs extra knowledge or resources to alleviate the OOV problem. However, for some low-resource languages or domains, it it not feasible. Byte pair encoding is an effective way to segment a word into subwords for alpha- betic languages such as English, German and French, and it does not rely on external re- sources [19]. However, it is not that straightforward for the stroke-based languages, such as Chinese. For Chinese-sourced NMT systems, word is often used as the basic unit in the input sequence. However, word-level unit for a large-scale data set, compared to the character-level and subword-level units, will bring a data sparsity problem in terms of rare words, i.e. many name entities, date, time and numbers occur infrequently, resulting in a very huge vocabulary. Therefore, in NMT these infrequent words are, accordingly, represented as an “UNK” token. Intuitively, if we can transform Chinese characters into alphabetic compositions, then we can easily employ the BPE algorithm to convert Chinese words into subwords and may alleviate the rare words problem. Chinese Pinyin, literally means “spelled sounds”, is the official romanization sys- tem for Standard Chinese in mainland China, Malaysia, Singapore and Taiwan.1 The system includes four diacritics denoting tones. Pinyin without tone marks is used to spell Chinese names and words in languages written with the Latin alphabet, and also in certain computer input methods to enter Chinese characters. An example of Chinese characters with their corresponding Pinyin and English translations is shown below.2 Pinyin: mā má mǎ mà ma Character: 妈 麻 马 骂 吗 Tone: First Second Third Fourth Neural English: mother hemp horse scold question particle In this example, we can see that in the row of “Pinyin”, the letters are same, but the tones are different, which indicates that the pronunciation for each Chinese character in the row of “Character” is different. Tones are essential for correct pronunciation of Mandarin syllables. Normally, the tone is placed over the letter that represents the syllable nucleus ex- cept the “Neural” tone. Explanations for tones are: – The first tone (Flat or High Level Tone) is represented by a macron (ˉ) added to the pinyin vowel; – The second tone (Rising or High-Rising Tone) is denoted by an acute accent (´ ); – The third tone (Falling-Rising or Low Tone) is marked by a caron (ˇ ); – The fourth tone (Falling or High-Falling Tone) is represented by a grave accent (` ); – The fifth tone (Neutral Tone) is represented by a normal vowel without any accent mark. 1 https://en.wikipedia.org/wiki/Pinyin 2 The source of the example is: https://en.wikipedia.org/wiki/Pinyin From this example we can see that Chinese characters and words can be converted into alphabetic forms using Pinyin with or without tones, so the BPE algorithm can be applied like alphabetic languages. In this paper, we explore different ways to utilize Pinyin as the subword unit converter for Chinese-sourced NMT, namely character-level Pinyin without tones (ChPy), character-level Pinyin with tones (ChPyT), word-level Pinyin without tones (WdPy), and word-level Pinyin with tones (WdPyT). Furthermore, we propose to use Pinyin as an input factor for a standard word-level NMT system (fac. NMT), and use tones as the input factor for the “WdPy” NMT system (fac. WdPy), respectively. Extensive experiments conducted on Chinese→English NIST translation task show that 1) using Pinyin to replace Chinese characters/words can significantly reduce the vocabulary size, resulting in a significant decrease of UNK symbols in trans- lations; 2) WdPyT and factor-based Pinyin NMT systems can significantly improve translation quality compared to the standard word-level NMT system. The main contributions of this work include: – We extensively investigate different use of Pinyin as subword units for Chinese- sourced NMT systems. – We propose to integrate Pinyin or tones as input factors to augment NMT systems. – We provide a qualitative analysis on the translation results. The rest of the paper is organised as follows. In Section 2, related work to the open vocabulary problem is introduced. Section 3 describes the attentional encoder–decoder framework for NMT, and introduces the factored NMT. In Section 4, we detail the proposed different Pinyin-based NMT frameworks. In Section 5, we report the experi- mental results on Chinese→English NIST task. Section 6 concludes and gives avenues for future work. 2 Related Work The work on the open vocabulary problem for NMT can be roughly categorised into three categories: – UNK in post-processing: This is a traditional way that is usually used in SMT to handle OOVs in the translation, e.g. using a back-off dictionary to translate OOVs. Different from SMT, NMT does not have a hard alignment between the source and target words, so the UNKs in the translation are not strictly aligned to those in the source sequence. – UNK in pre-processing: in this scenario, the unknown words in the source-side input are substituted by semantically similar words or paraphrases. However, it is not guaranteed that a proper substitution can be acquired from the limited in- vocabulary words. This method is not only applicable for alphabetic languages, but also for stroke-based languages. Splitting words into subwords is another effective way to pre-process source-side sentences, such as the BPE. – UNK in decoding: in this scenario, the UNKs are dynamically processed during de- coding. For example, a word-character combined model can be used to recover tar- get UNK by a character-based model if the input word is an OOV. Another method- ology is to manipulate a large-scale target vocabulary by selecting a subset to speed up the decoding and alleviate the UNK problem. Regarding the first category, Luong et al. propose a back-off dictionary method to handle OOVs in the translation [14]. They first train an NMT system augmented by the output of a word alignment algorithm, allowing the NMT system to emit, for each OOV word in the target sentence, the position of its corresponding word in the source sentence. Then a post-processing step is used to translate every OOV word using a dictionary. Their experiments on the WMT’14 English→French translation task show a substantial improvement of up to 2.8 BLEU points over a standard NMT system. In terms of the second category, Li et al. propose a substitution-translation-restoration method [11]. The rare words in a testing sentence are replaced with similar in-vocabulary words based on a similarity model learnt from monolingual data in the substitution step. In translation and restoration steps, the sentence will be translated with a model trained on new bilingual data with rare words replaced, and finally the translations of the re- placed words will be substituted by those of original ones. Experiments on Chinese-to- English translation demonstrate that the proposed method can significantly outperform the standard attentional NMT system. Sennrich et al. propose a variant of byte pair encoding for word segmentation in the source sentences, which is capable of encoding open vocabularies with a compact sym- bol vocabulary of variable-length subword units [19]. This method is simpler, and more effective than using a back-off translation model. Experiments on the WMT’15 transla- tion tasks English→German and English→Russian show that the BPE-based subword models significantly outperform the back-off dictionary baseline. With respect to the third category, extensive work has been done on using very large target vocabulary for NMT [6, 15, 10]. The basic idea is to select a subset from a large-scale target vocabulary to produce a target word during the decoding process. Their experiments on different language pairs show that the proposed methods can not only speed up the translation, but also alleviate the UNK problem. Luong and Manning propose a word-character solution to achieving open vocabulary NMT [12]. A hybrid system is built to translate mostly at the word level and consult the character compo- nents for rare words, i.e. a character-based model will be used to recover the target UNK if the input word is an OOV. On the WMT’15 English→Czech translation task, the proposed hybrid approach outperforms systems that already handle unknown words. 3 Neural Machine Translation 3.1 Attentional NMT The basic principle of an NMT system is that it can map a source-side sentence x = (x1 , . . . , xm ) to a target sentence y = (y1 , . . . , yn ) in a continuous vector space, where all sentences are assumed to terminate with a special “end-of-sentence” token < eos >. Conceptually, an NMT system employs neural networks to solve the conditional distri- butions in (1): n Y p(y|x) = p(yi |y