=Paper= {{Paper |id=Vol-2086/AICS2017_paper_14 |storemode=property |title=Pinyin as Subword Unit for Chinese-Sourced Neural Machine Translation |pdfUrl=https://ceur-ws.org/Vol-2086/AICS2017_paper_14.pdf |volume=Vol-2086 |authors=Jinhua Du,Andy Way |dblpUrl=https://dblp.org/rec/conf/aics/DuW17 }} ==Pinyin as Subword Unit for Chinese-Sourced Neural Machine Translation== https://ceur-ws.org/Vol-2086/AICS2017_paper_14.pdf

Pinyin as Subword Unit for Chinese-Sourced Neural
Machine Translation

Jinhua Du†‡ , Andy Way†
†
ADAPT Centre, School of Computing, Dublin City University, Ireland
‡
Accenture Labs, Dublin, Ireland
{jinhua.du, andy.way}@adaptcentre.ie

Abstract. Unknown word (UNK) or open vocabulary is a challenging problem
for neural machine translation (NMT). For alphabetic languages such as English,
German and French, transforming a word into subwords is an effective way to al-
leviate the UNK problem, such as the Byte Pair encoding (BPE) algorithm. How-
ever, for the stroke-based languages, such as Chinese, aforementioned method is
not effective enough for translation quality. In this paper, we propose to utilize
Pinyin, a romanization system for Chinese characters, to convert Chinese char-
acters to subword units to alleviate the UNK problem. We first investigate that
how Pinyin and its four diacritics denoting tones affect translation performance
of NMT systems, and then propose different strategies to utilise Pinyin and tones
as input factors for Chinese–English NMT. Extensive experiments conducted on
Chinese–English translation demonstrate that the proposed methods can remark-
ably improve the translation quality, and can effectively alleviate the UNK prob-
lem for Chinese-sourced translation.

1 Introduction
In recent years, NMT has made impressive progress [8, 3, 20, 1, 4]. The state-of-the-art
NMT model employs an encoder–decoder architecture with an attention mechanism, in
which the encoder summarizes the source sentence into a vector representation, and the
decoder produces the target string word by word from vector representations, and the
attention mechanism learns the soft alignment of a target word against source words [1].
NMT systems have outperformed the state-of-the-art SMT model on various language
pairs in terms of translation quality [13, 2, 7, 22, 21, 5].
The translation of rare words is not only an open problem for statistical machine
translation (SMT), but also for NMT. Current NMT systems take a fixed vocabulary for
the input and output sequences, and the rare words in the data are denoted as a symbol
“UNK”, which will make the translation inaccurate and disfluent to some extent. The
vocabulary of neural models is typically limited to 30,000–50,000 words, but trans-
lation is an open-vocabulary problem, especially for languages with productive word
formation processes, such as agglutination and compounding. In these cases, transla-
tion models require mechanisms that go below the word level [19].
Recent work has been done in improving the generalisation capability of NMT for
open vocabulary [14, 6, 11, 12, 19, 10]. For example, translation of the out-of-vocabulary
(OOV) words can be regarded as a post-processing step as in SMT, i.e. keeping the
OOVs in the hypothesis, and then using a bilingual dictionary to obtain translations
of these OOVs. The deficiency of this back-off dictionary method is that it needs extra
knowledge or resources to alleviate the OOV problem. However, for some low-resource
languages or domains, it it not feasible.
Byte pair encoding is an effective way to segment a word into subwords for alpha-
betic languages such as English, German and French, and it does not rely on external re-
sources [19]. However, it is not that straightforward for the stroke-based languages, such
as Chinese. For Chinese-sourced NMT systems, word is often used as the basic unit in
the input sequence. However, word-level unit for a large-scale data set, compared to the
character-level and subword-level units, will bring a data sparsity problem in terms of
rare words, i.e. many name entities, date, time and numbers occur infrequently, resulting
in a very huge vocabulary. Therefore, in NMT these infrequent words are, accordingly,
represented as an “UNK” token. Intuitively, if we can transform Chinese characters
into alphabetic compositions, then we can easily employ the BPE algorithm to convert
Chinese words into subwords and may alleviate the rare words problem.
Chinese Pinyin, literally means “spelled sounds”, is the official romanization sys-
tem for Standard Chinese in mainland China, Malaysia, Singapore and Taiwan.1 The
system includes four diacritics denoting tones. Pinyin without tone marks is used to
spell Chinese names and words in languages written with the Latin alphabet, and also
in certain computer input methods to enter Chinese characters.
An example of Chinese characters with their corresponding Pinyin and English
translations is shown below.2

Pinyin: mā má mǎ mà ma
Character: 妈 麻 马 骂 吗
Tone: First Second Third Fourth Neural
English: mother hemp horse scold question particle
In this example, we can see that in the row of “Pinyin”, the letters are same, but the
tones are different, which indicates that the pronunciation for each Chinese character
in the row of “Character” is different. Tones are essential for correct pronunciation of
Mandarin syllables.
Normally, the tone is placed over the letter that represents the syllable nucleus ex-
cept the “Neural” tone. Explanations for tones are:

– The first tone (Flat or High Level Tone) is represented by a macron (ˉ) added to
the pinyin vowel;
– The second tone (Rising or High-Rising Tone) is denoted by an acute accent (´ );
– The third tone (Falling-Rising or Low Tone) is marked by a caron (ˇ );
– The fourth tone (Falling or High-Falling Tone) is represented by a grave accent (`
);
– The fifth tone (Neutral Tone) is represented by a normal vowel without any accent
mark.

1
https://en.wikipedia.org/wiki/Pinyin
2
The source of the example is: https://en.wikipedia.org/wiki/Pinyin
From this example we can see that Chinese characters and words can be converted
into alphabetic forms using Pinyin with or without tones, so the BPE algorithm can be
applied like alphabetic languages. In this paper, we explore different ways to utilize
Pinyin as the subword unit converter for Chinese-sourced NMT, namely character-level
Pinyin without tones (ChPy), character-level Pinyin with tones (ChPyT), word-level
Pinyin without tones (WdPy), and word-level Pinyin with tones (WdPyT). Furthermore,
we propose to use Pinyin as an input factor for a standard word-level NMT system (fac.
NMT), and use tones as the input factor for the “WdPy” NMT system (fac. WdPy),
respectively. Extensive experiments conducted on Chinese→English NIST translation
task show that 1) using Pinyin to replace Chinese characters/words can significantly
reduce the vocabulary size, resulting in a significant decrease of UNK symbols in trans-
lations; 2) WdPyT and factor-based Pinyin NMT systems can significantly improve
translation quality compared to the standard word-level NMT system.
The main contributions of this work include:
– We extensively investigate different use of Pinyin as subword units for Chinese-
sourced NMT systems.
– We propose to integrate Pinyin or tones as input factors to augment NMT systems.
– We provide a qualitative analysis on the translation results.
The rest of the paper is organised as follows. In Section 2, related work to the open
vocabulary problem is introduced. Section 3 describes the attentional encoder–decoder
framework for NMT, and introduces the factored NMT. In Section 4, we detail the
proposed different Pinyin-based NMT frameworks. In Section 5, we report the experi-
mental results on Chinese→English NIST task. Section 6 concludes and gives avenues
for future work.

2 Related Work
The work on the open vocabulary problem for NMT can be roughly categorised into
three categories:
– UNK in post-processing: This is a traditional way that is usually used in SMT to
handle OOVs in the translation, e.g. using a back-off dictionary to translate OOVs.
Different from SMT, NMT does not have a hard alignment between the source and
target words, so the UNKs in the translation are not strictly aligned to those in the
source sequence.
– UNK in pre-processing: in this scenario, the unknown words in the source-side
input are substituted by semantically similar words or paraphrases. However, it
is not guaranteed that a proper substitution can be acquired from the limited in-
vocabulary words. This method is not only applicable for alphabetic languages, but
also for stroke-based languages. Splitting words into subwords is another effective
way to pre-process source-side sentences, such as the BPE.
– UNK in decoding: in this scenario, the UNKs are dynamically processed during de-
coding. For example, a word-character combined model can be used to recover tar-
get UNK by a character-based model if the input word is an OOV. Another method-
ology is to manipulate a large-scale target vocabulary by selecting a subset to speed
up the decoding and alleviate the UNK problem.
Regarding the first category, Luong et al. propose a back-off dictionary method to
handle OOVs in the translation [14]. They first train an NMT system augmented by
the output of a word alignment algorithm, allowing the NMT system to emit, for each
OOV word in the target sentence, the position of its corresponding word in the source
sentence. Then a post-processing step is used to translate every OOV word using a
dictionary. Their experiments on the WMT’14 English→French translation task show
a substantial improvement of up to 2.8 BLEU points over a standard NMT system.
In terms of the second category, Li et al. propose a substitution-translation-restoration
method [11]. The rare words in a testing sentence are replaced with similar in-vocabulary
words based on a similarity model learnt from monolingual data in the substitution step.
In translation and restoration steps, the sentence will be translated with a model trained
on new bilingual data with rare words replaced, and finally the translations of the re-
placed words will be substituted by those of original ones. Experiments on Chinese-to-
English translation demonstrate that the proposed method can significantly outperform
the standard attentional NMT system.
Sennrich et al. propose a variant of byte pair encoding for word segmentation in the
source sentences, which is capable of encoding open vocabularies with a compact sym-
bol vocabulary of variable-length subword units [19]. This method is simpler, and more
effective than using a back-off translation model. Experiments on the WMT’15 transla-
tion tasks English→German and English→Russian show that the BPE-based subword
models significantly outperform the back-off dictionary baseline.
With respect to the third category, extensive work has been done on using very
large target vocabulary for NMT [6, 15, 10]. The basic idea is to select a subset from
a large-scale target vocabulary to produce a target word during the decoding process.
Their experiments on different language pairs show that the proposed methods can not
only speed up the translation, but also alleviate the UNK problem. Luong and Manning
propose a word-character solution to achieving open vocabulary NMT [12]. A hybrid
system is built to translate mostly at the word level and consult the character compo-
nents for rare words, i.e. a character-based model will be used to recover the target
UNK if the input word is an OOV. On the WMT’15 English→Czech translation task,
the proposed hybrid approach outperforms systems that already handle unknown words.

3 Neural Machine Translation
3.1 Attentional NMT
The basic principle of an NMT system is that it can map a source-side sentence x =
(x1 , . . . , xm ) to a target sentence y = (y1 , . . . , yn ) in a continuous vector space, where
all sentences are assumed to terminate with a special “end-of-sentence” token < eos >.
Conceptually, an NMT system employs neural networks to solve the conditional distri-
butions in (1):
n
Y
p(y|x) = p(yi |y