<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Pinyin as Subword Unit for Chinese-Sourced Neural Machine Translation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jinhua Duyz</string-name>
          <email>jinhua.du@adaptcentre.ie</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andy Wayy</string-name>
          <email>andy.way@adaptcentre.ie</email>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Unknown word (UNK) or open vocabulary is a challenging problem for neural machine translation (NMT). For alphabetic languages such as English, German and French, transforming a word into subwords is an effective way to alleviate the UNK problem, such as the Byte Pair encoding (BPE) algorithm. However, for the stroke-based languages, such as Chinese, aforementioned method is not effective enough for translation quality. In this paper, we propose to utilize Pinyin, a romanization system for Chinese characters, to convert Chinese characters to subword units to alleviate the UNK problem. We first investigate that how Pinyin and its four diacritics denoting tones affect translation performance of NMT systems, and then propose different strategies to utilise Pinyin and tones as input factors for Chinese-English NMT. Extensive experiments conducted on Chinese-English translation demonstrate that the proposed methods can remarkably improve the translation quality, and can effectively alleviate the UNK problem for Chinese-sourced translation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        In recent years, NMT has made impressive progress [
        <xref ref-type="bibr" rid="ref1 ref20 ref3 ref4 ref8">8, 3, 20, 1, 4</xref>
        ]. The state-of-the-art
NMT model employs an encoder–decoder architecture with an attention mechanism, in
which the encoder summarizes the source sentence into a vector representation, and the
decoder produces the target string word by word from vector representations, and the
attention mechanism learns the soft alignment of a target word against source words [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
NMT systems have outperformed the state-of-the-art SMT model on various language
pairs in terms of translation quality [
        <xref ref-type="bibr" rid="ref13 ref2 ref21 ref22 ref5 ref7">13, 2, 7, 22, 21, 5</xref>
        ].
      </p>
      <p>
        The translation of rare words is not only an open problem for statistical machine
translation (SMT), but also for NMT. Current NMT systems take a fixed vocabulary for
the input and output sequences, and the rare words in the data are denoted as a symbol
“UNK”, which will make the translation inaccurate and disfluent to some extent. The
vocabulary of neural models is typically limited to 30,000–50,000 words, but
translation is an open-vocabulary problem, especially for languages with productive word
formation processes, such as agglutination and compounding. In these cases,
translation models require mechanisms that go below the word level [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>
        Recent work has been done in improving the generalisation capability of NMT for
open vocabulary [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref14 ref19 ref6">14, 6, 11, 12, 19, 10</xref>
        ]. For example, translation of the out-of-vocabulary
(OOV) words can be regarded as a post-processing step as in SMT, i.e. keeping the
OOVs in the hypothesis, and then using a bilingual dictionary to obtain translations
of these OOVs. The deficiency of this back-off dictionary method is that it needs extra
knowledge or resources to alleviate the OOV problem. However, for some low-resource
languages or domains, it it not feasible.
      </p>
      <p>
        Byte pair encoding is an effective way to segment a word into subwords for
alphabetic languages such as English, German and French, and it does not rely on external
resources [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. However, it is not that straightforward for the stroke-based languages, such
as Chinese. For Chinese-sourced NMT systems, word is often used as the basic unit in
the input sequence. However, word-level unit for a large-scale data set, compared to the
character-level and subword-level units, will bring a data sparsity problem in terms of
rare words, i.e. many name entities, date, time and numbers occur infrequently, resulting
in a very huge vocabulary. Therefore, in NMT these infrequent words are, accordingly,
represented as an “UNK” token. Intuitively, if we can transform Chinese characters
into alphabetic compositions, then we can easily employ the BPE algorithm to convert
Chinese words into subwords and may alleviate the rare words problem.
      </p>
      <p>Chinese Pinyin, literally means “spelled sounds”, is the official romanization
system for Standard Chinese in mainland China, Malaysia, Singapore and Taiwan.1 The
system includes four diacritics denoting tones. Pinyin without tone marks is used to
spell Chinese names and words in languages written with the Latin alphabet, and also
in certain computer input methods to enter Chinese characters.</p>
      <p>An example of Chinese characters with their corresponding Pinyin and English
translations is shown below.2</p>
      <sec id="sec-1-1">
        <title>Pinyin:</title>
      </sec>
      <sec id="sec-1-2">
        <title>Character:</title>
      </sec>
      <sec id="sec-1-3">
        <title>Tone:</title>
      </sec>
      <sec id="sec-1-4">
        <title>English:</title>
        <p>In this example, we can see that in the row of “Pinyin”, the letters are same, but the
tones are different, which indicates that the pronunciation for each Chinese character
in the row of “Character” is different. Tones are essential for correct pronunciation of
Mandarin syllables.</p>
        <p>Normally, the tone is placed over the letter that represents the syllable nucleus
except the “Neural” tone. Explanations for tones are:
– The first tone (Flat or High Level Tone) is represented by a macron (ˉ) added to
the pinyin vowel;
– The second tone (Rising or High-Rising Tone) is denoted by an acute accent (´ );
– The third tone (Falling-Rising or Low Tone) is marked by a caron (ˇ );
– The fourth tone (Falling or High-Falling Tone) is represented by a grave accent (`
);
– The fifth tone (Neutral Tone) is represented by a normal vowel without any accent
mark.</p>
        <sec id="sec-1-4-1">
          <title>1 https://en.wikipedia.org/wiki/Pinyin 2 The source of the example is: https://en.wikipedia.org/wiki/Pinyin</title>
          <p>From this example we can see that Chinese characters and words can be converted
into alphabetic forms using Pinyin with or without tones, so the BPE algorithm can be
applied like alphabetic languages. In this paper, we explore different ways to utilize
Pinyin as the subword unit converter for Chinese-sourced NMT, namely character-level
Pinyin without tones (ChPy), character-level Pinyin with tones (ChPyT), word-level
Pinyin without tones (WdPy), and word-level Pinyin with tones (WdPyT). Furthermore,
we propose to use Pinyin as an input factor for a standard word-level NMT system (fac.
NMT), and use tones as the input factor for the “WdPy” NMT system (fac. WdPy),
respectively. Extensive experiments conducted on Chinese!English NIST translation
task show that 1) using Pinyin to replace Chinese characters/words can significantly
reduce the vocabulary size, resulting in a significant decrease of UNK symbols in
translations; 2) WdPyT and factor-based Pinyin NMT systems can significantly improve
translation quality compared to the standard word-level NMT system.</p>
          <p>The main contributions of this work include:
– We extensively investigate different use of Pinyin as subword units for
Chinesesourced NMT systems.
– We propose to integrate Pinyin or tones as input factors to augment NMT systems.
– We provide a qualitative analysis on the translation results.</p>
          <p>The rest of the paper is organised as follows. In Section 2, related work to the open
vocabulary problem is introduced. Section 3 describes the attentional encoder–decoder
framework for NMT, and introduces the factored NMT. In Section 4, we detail the
proposed different Pinyin-based NMT frameworks. In Section 5, we report the
experimental results on Chinese!English NIST task. Section 6 concludes and gives avenues
for future work.
2</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>The work on the open vocabulary problem for NMT can be roughly categorised into
three categories:
– UNK in post-processing: This is a traditional way that is usually used in SMT to
handle OOVs in the translation, e.g. using a back-off dictionary to translate OOVs.
Different from SMT, NMT does not have a hard alignment between the source and
target words, so the UNKs in the translation are not strictly aligned to those in the
source sequence.
– UNK in pre-processing: in this scenario, the unknown words in the source-side
input are substituted by semantically similar words or paraphrases. However, it
is not guaranteed that a proper substitution can be acquired from the limited
invocabulary words. This method is not only applicable for alphabetic languages, but
also for stroke-based languages. Splitting words into subwords is another effective
way to pre-process source-side sentences, such as the BPE.
– UNK in decoding: in this scenario, the UNKs are dynamically processed during
decoding. For example, a word-character combined model can be used to recover
target UNK by a character-based model if the input word is an OOV. Another
methodology is to manipulate a large-scale target vocabulary by selecting a subset to speed
up the decoding and alleviate the UNK problem.</p>
      <p>
        Regarding the first category, Luong et al. propose a back-off dictionary method to
handle OOVs in the translation [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. They first train an NMT system augmented by
the output of a word alignment algorithm, allowing the NMT system to emit, for each
OOV word in the target sentence, the position of its corresponding word in the source
sentence. Then a post-processing step is used to translate every OOV word using a
dictionary. Their experiments on the WMT’14 English!French translation task show
a substantial improvement of up to 2.8 BLEU points over a standard NMT system.
      </p>
      <p>
        In terms of the second category, Li et al. propose a substitution-translation-restoration
method [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The rare words in a testing sentence are replaced with similar in-vocabulary
words based on a similarity model learnt from monolingual data in the substitution step.
In translation and restoration steps, the sentence will be translated with a model trained
on new bilingual data with rare words replaced, and finally the translations of the
replaced words will be substituted by those of original ones. Experiments on
Chinese-toEnglish translation demonstrate that the proposed method can significantly outperform
the standard attentional NMT system.
      </p>
      <p>
        Sennrich et al. propose a variant of byte pair encoding for word segmentation in the
source sentences, which is capable of encoding open vocabularies with a compact
symbol vocabulary of variable-length subword units [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. This method is simpler, and more
effective than using a back-off translation model. Experiments on the WMT’15
translation tasks English!German and English!Russian show that the BPE-based subword
models significantly outperform the back-off dictionary baseline.
      </p>
      <p>
        With respect to the third category, extensive work has been done on using very
large target vocabulary for NMT [
        <xref ref-type="bibr" rid="ref10 ref15 ref6">6, 15, 10</xref>
        ]. The basic idea is to select a subset from
a large-scale target vocabulary to produce a target word during the decoding process.
Their experiments on different language pairs show that the proposed methods can not
only speed up the translation, but also alleviate the UNK problem. Luong and Manning
propose a word-character solution to achieving open vocabulary NMT [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. A hybrid
system is built to translate mostly at the word level and consult the character
components for rare words, i.e. a character-based model will be used to recover the target
UNK if the input word is an OOV. On the WMT’15 English!Czech translation task,
the proposed hybrid approach outperforms systems that already handle unknown words.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Neural Machine Translation</title>
      <p>Attentional NMT
The basic principle of an NMT system is that it can map a source-side sentence x =
(x1; : : : ; xm) to a target sentence y = (y1; : : : ; yn) in a continuous vector space, where
all sentences are assumed to terminate with a special “end-of-sentence” token &lt; eos &gt;.
Conceptually, an NMT system employs neural networks to solve the conditional
distributions in (1):
p(yjx) =
n
Y p(yijy&lt;i; x m)
i=1
(1)</p>
      <p>
        We utilise the NMT architecture in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which is implemented as an attentional
encoder-decoder network with recurrent neural networks (RNN).
      </p>
      <p>
        In this framework, the encoder is a bidirectional neural network [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] with gated
recurrent units [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] where a source-side sequence x is converted to a one-hot vector and
fed in as the input, and then a forward sequence of hidden states (!h 1; : : : ; !h m) and
a backward sequence of hidden states ( h 1; : : : ; h m) are calculated and concatenated
to form the annotation vector hj . The decoder is also an RNN that predicts a target
sequence y word by word where each word yi is generated conditioned on the decoder
hidden state si, the previous target word yi 1, and the source-side context vector ci, as
in (2):
      </p>
      <p>p(yijy&lt;i; x) = g(yi 1; si; ci)
where g is the activation function that outputs the probability of yi, and ci is calculated
as a weighted sum of the annotations hj . The weight ij is computed as in (3):
where
ij =
exp(eij )
m
P exp(eik)
k=1
eij = a(si 1; hj )
is an alignment model which models the probability that the inputs around position j are
aligned to the output at position i. The alignment model is a single-layer feedforward
neural network that is learned jointly through backpropagation.
3.2</p>
      <p>
        Factored NMT
Factored NMT, introduced in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], represents the encoder input as a combination of
features as in (4):
      </p>
      <p>jF j
h j = g(W!( n Ekxjk) + !U!h j 1)
!</p>
      <p>
        k=1
where k is the vector concatenation, Ek 2 Rmk Kk are the feature-embedding
matrices, with PjkF=j1 mk = m, and Kk is the vocabulary size of the kth feature, and jF j is
the number of features in the feature set F [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        In factored NMT, the features can be any form of knowledge which might be useful
to NMT systems, such as POS tags, lemmas, morphological features and dependency
labels as used in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. In our work, we use Pinyin or tones as the input factor to augment
NMT (c.f. Section 4).
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Pinyin for Chinese-Sourced Subword NMT</title>
      <p>Chinese Pinyin, as a romanization system for Standard Chinese, is often used for
teaching purpose or computer input means. Four tones, namely the first, second, third and
(2)
(3)
(4)
fourth tones with a neural tone are used to distinguish different characters/words with
different pronunciations.</p>
      <p>By transforming Chinese characters to Pinyin forms, the BPE method to encode
words into subwords can be directly applied. In order to investigate what a role of tones
play for Pinyin-based NMT systems, we set up four different configurations, namely:
– ChPy: the NMT system takes the character-level Pinyin without tones as input.</p>
      <p>The character-level NMT indicates that we first segment a Chinese sentence into a
character sequence, and then convert each Character into its Pinyin form without
the tone.
– ChPyT: the NMT system takes the character-level Pinyin with tones as input. In
some sense, the tone is helpful to disambiguate the character in a context.
– WdPy: the NMT system takes the word-level Pinyin without tones as input. The
word-level NMT indicates that we first segment a Chinese sentence into a word
sequence, and then convert each word into its Pinyin form without the tone.
– WdPyT: the NMT system takes the word-level Pinyin with tones as input.
An example is shown below in terms of these four settings.</p>
      <p>Chinese: 许多球队以纪律和组织战来降低风险
English: many teams try to reduce risks through discipline and organization
Characters: 许 多 球 队 以 纪 律 和 组 织 战 来 降 低 风 险
ChPy: xu duo qiu dui yi ji lv he zu zhi zhan lai jiang di feng xian
ChPyT: xuˇ duo¯ qiú duì yˇı jì lǜhé zuˇ zh¯ı zhàn lái jiàng d¯ı fe¯ng xiaˇn
Words: 许多 球队 以 纪律 和 组织战 来 降低 风险
WdPy: xuduo qiudui yi jilv he zuzhizhan lai jiangdi fengxian
WdPy(BPE): xuduo qiudui yi jilv he zu@@ zhizhan lai jiangdi fengxian
WdPyT: xuˇduo¯ qiúduì yˇı jìlǜhé zuˇzh¯ızhàn lái jiàngd¯ı fe¯ngxiaˇn
WdPyT(BPE):xuˇduo¯ qiúduì yˇı jìlǜhé zuˇzh¯ı@@ zhàn lái jiàngd¯ı fe¯ngxiaˇn
In this example, we can see that:
– BPE can be easily applied to either character-level or word-level Pinyin sequence,
either with or without tones, which can reduce the OOVs in the data.
– The BPE algorithm encodes an infrequent word “zuzhizhan” (“组织战” in Chinese
and “organization” in English) in WdPy to subwords “zu@@” and “zhizhan” in
WdPy(BPE), which represent “组” and “织战”, respectively. However, this
segmentation is not correct in terms of semantic meaning because “织战” is not a
meaningful subword. We infer that this is caused by the homophone “zhizhan”, i.e.
the same Pinyin might correspond to different word forms. For example, “zhizhan”
could be Chinese word “之战” (“fight” in English), “只占” (“has only” in English)
etc. Therefore, the word Pinyin without tones brings more ambiguities for
translation.
– For the WdPyT, we can see that the word “zuˇzh¯ızhàn” (“组 织 战” in Chinese
and “organization” in English) is encoded to subwords “zuˇzh¯ı@@” and “zhàn” in
WdPyT(BPE). The subword “zuˇzh¯ı@@” indicates “organization” in English and
“zhàn” represents “fight” in English. This segmentation is meaningful and correct,
so the tone is indeed helpful to disambiguate Pinyin forms.</p>
      <p>We also utilize Pinyin and tones as input factors for NMT systems, namely 1)
wordlevel Pinyin without tones as the input factor of Chinese words for a standard NMT
system; 2) tones as the input factor for the WdPy NMT system.</p>
      <p>An example to illustrate these two factored NMT systems is shown below.</p>
      <sec id="sec-4-1">
        <title>Chinese: 在本届世足赛大放异彩</title>
        <p>English: dazzles at the world cup
factored NMT: 在jzai 本jben 届jjie 世足赛jshizusai 大放异彩jdafangyicai
WdPy(BPE) zai ben jie shizusai dafangyicai
factored WdPy: zaij4 benj3 jiej4 shizusaij4-2-4 dafangyicaij4-4-4-3
factored WdPy(BPE): zaijOj4 benjOj3 jiejOj4 shi@@jBj4-2-4
dafang@@jBj4-4-4-3 yicaijEj4-4-4-3
zusaijEj4-2-4
In this example, “factored NMT” (fac. NMT) represents that the encoder takes the
Chinese word and its Pinyin without tones as input. In the row of factored NMT, the
factor on the left of the vertical bar “|” represents a Chinese word, and the factor on the
right of “|” indicates the corresponding Pinyin of the word. We expect that the Pinyin
can provide extra information to the word to further improve translation performance.
WdPy(BPE) indicates that the BPE algorithm is applied to the word-level Pinyin NMT
without tones. “factored WdPy” (fac. WdPy) represents that the tone is integrated into
the encoder of WdPy NMT as an input factor. The left of the vertical bar “|” is the Pinyin
of a word, and the right is its corresponding tone. “factored WdPy(BPE)” represents that
the BPE algorithm is applied to the fac. WdPy. We can see that:
– by applying BPE, we have one more factor set {O, B, E} where “O” indicates a
non-subword, “B” indicates the beginning of subwords, and “E” represents the end
of subwords.
– the infrequent words “shizusai” and “dafangyicai” are segmented into subwords
“shi@@” (literally “world”) and “zusai” (literally “football game”), and “dafang@@”
(literally “demonstrate”) and “yicai” (literally “splendor”), respectively. From
English translations of these subwords, we can see that rare words are possible to be
translated by the segmentation.
– in WdPy(BPE), the Pinyin word “shizusai” and “dafangyicai” are not segmented,
but they are segmented in factored WdPy(BPE), so we infer that tones are helpful
to segment rare words into subwords.
5
5.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <p>Experimental Settings
For Chinese!English task3, we use 1.4M sentence pairs extracted from LDC ZH–EN
corpora as the training data, and NIST 2004 current set as the development/validation
set that contains 1,597 sentences, and NIST 2005 current set as the test set that contains
1,082 sentences. There are four references for each Chinese sentence.
3 In the rest of the paper, we use ZH and EN to denote Chinese and English, respectively.</p>
      <p>
        The baseline NMT system takes the Chinese word sequence as input without any
Pinyin information, which is also defined as the standard NMT system. We use
Nematus [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] as the NMT system, and set minibatches of size 80, a maximum sentence
length of 60, word embeddings of size 600, and hidden layers of size 1024. The
vocabulary size for input and output is set to 45K. The models are trained with the Adadelta
optimizer [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], reshuffling the training corpus between epochs. We validate the model
every 10,000 minibatches via BLEU [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] scores on the validation set and save the model
every 10,000 iterations.
      </p>
      <p>
        As in [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], for factored NMT systems, in order to ensure that performance
improvements are not simply due to an increase in the number of model parameters, we keep
the total size of the embedding layer fixed to 600.
      </p>
      <p>
        We use a Python tool to convert Chinese characters/words into Pinyin.4
All results are reported by case-insensitive BLEU scores and statistical significance
is calculate via a bootstrap resampling significance test [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
5.2
      </p>
      <p>Statistics
The source-side vocabulary sizes of different Pinyin-based NMT systems and factored
NMT systems are shown in Table 1.</p>
      <p>SYS
baseline ChPy ChPyT WdPy WdPyT
fac. NMT
fac. WdPy
V-size 185,029 15,872 19,697 97,918 114,067 185,029/144,584 50,589/9,700
Ratio(%) – 8.6 10.65 52.92 61.65 –/78.14 27.34/5.24
Table 1. Vocabulary sizes of the source-side training data from different NMT systems
In Table 1, all NMT systems except the baseline and factored NMT (fac. NMT) are
applied the BPE algorithm. “Ratio(%)” indicates the percentage of the vocabulary size
of a Pinyin-based NMT system over that of the baseline.We can see that vocabulary
sizes of all Pinyin-based NMT systems, namely the ChPy, ChPyT, WdPy, WdPyT and
fac. WdPy are significantly reduced. The reduction of the vocabulary size also indicates
the decrease of rare words.
5.3</p>
      <p>Experimental Results
– ChPy and ChPyT are significantly worse than the baseline. We analyse that this is
1) due to the significantly longer sequences caused by character-level units; 2) due
to the smaller vocabulary that introduces more ambiguities to the Pinyin characters.</p>
      <sec id="sec-5-1">
        <title>4 https://github.com/mozillazg/python-pinyin</title>
        <p>– WdPy is comparable to the baseline in terms of BLEU. However, the vocabulary
size of WdPy is only 52.92% of that of the baseline. The results from ChPy, ChPyT
and WdPy give us an inspiration: if we add extra factors to disambiguate the Pinyin,
we might further improve the translation quality. Thus, we propose the fac. WdPy
to verify this intuition.
– WdPyT significantly improves translation performance by 1.25 (35.49!36.74)
BLEU points on the validation set, and 0.75 (31.76!32.51) BLEU points on the
test set, respectively, compared to the baseline. However, the vocabulary size of
WdPyT is only 61.65% of that of the baseline. The result shows that the word-level
Pinyin with tones can not only reduce the vocabulary size or rare words, but also
improve system performance.
– fac. WdPy significantly outperforms the baseline by 0.38 (31.76!32.14) BLEU
points on the test set, and significantly improves 0.51 (31.63!32.14) BLEU points
on the test set compared to WdPy, which shows that tones can provide extra useful
information to disambiguate the Pinyin word to further improve translation quality.
– fac. NMT significantly improves 1.96 (35.49!37.45) BLEU points on the
validation set, and 1.42 (31.76!33.18) BLEU points on the test set, respectively,
compared to the baseline. The results show that Pinyin as an input factor for the standard
NMT is indeed helpful.
Beside reporting the BLEU scores, we also examine the influence of Pinyin on the UNK
issue in translations. Table 3 shows the change of UNK symbols from different systems.
In Table 3, “Ratio” indicates the reduction rate of the number of UNK symbols in a
Pinyin-based NMT system over that of the baseline. From Table 3, we can see that:
– ChPy produces more UNK symbols in the translation. The reason is that the serious
ambiguity issue caused by the smaller vocabulary size makes the NMT system
produce many continuous UNK sequences.
– ChPyT reduces the number of UNKs in translations due to the constraint of tones
on character-level Pinyin to disambiguate the units.</p>
        <p>– WdPy and WdPyT significantly reduce the UNK symbols in translations.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this paper we propose a subword transformation solution for Chinese-sourced NMT,
i.e. use Chinese Pinyin to convert Chinese characters/words into subword units.
Subsequently, the BPE algorithm is directly applied to reduce the number of rare words.
Furthermore, we propose two factored NMT, one of which uses tones as the input
factor for word-level Pinyin NMT, and the other of which integrates word-level Pinyin
without tones as input factor to a standard word sequence-based NMT system. We
observe from experiments on Chinese!English NIST task that 1) Pinyin as subword unit
can indeed significantly reduce rare words. However, it can also introduce more
ambiguities. 2) tones can, on the one hand, keep the vocabulary size of a Pinyin-based
NMT in a reasonable scale, on the other hand, it can achieve comparable (WdPy) or
better (WdPyT) translation performance. 3) using Pinyin or tones as input factors can
improve translation quality compared to the baseline which shows that they can provide
extra information to disambiguate the input units.</p>
      <p>As to future work, we expect more experiments on more effective factors to further
improve translation performance, and we will explore the feasibility of Pinyin in the
Chinese-targeted NMT systems.</p>
      <p>Acknowledgement. We would like to thank the reviewers for their valuable and
constructive comments. This research is supported by the ADAPT Centre for Digital
Content Technology, funded under the SFI Research Centres Programme (Grant 13/RC/2106),
and by SFI Industry Fellowship Programme 2016 (Grant 16/IFB/4490).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bahdanau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.:</given-names>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          .
          <source>In: Proceedings of the 3rd International Conference on Learning Representations</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          . San Diego, USA (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bentivogli</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bisazza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cettolo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Federico</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Neural versus phrase-based machine translation quality: a case study</article-title>
          .
          <source>In: Proceedings of the EMNLP</source>
          . pp.
          <fpage>257</fpage>
          -
          <lpage>267</lpage>
          . Austin, Texas, USA (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cho</surname>
          </string-name>
          , K.,
          <string-name>
            <surname>van Merrienboer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gulcehre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bahdanau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bougares</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwenk</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Learning phrase representations using rnn encoder-decoder for statistical machine translation</article-title>
          .
          <source>In: Proceedings of the EMNLP</source>
          . pp.
          <fpage>1724</fpage>
          -
          <lpage>1734</lpage>
          . Doha,
          <string-name>
            <surname>Qatar</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Way</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Neural pre-translation for hybrid machine translation</article-title>
          .
          <source>In: In Proceedings of MT Summit XVI</source>
          , vol.
          <volume>1</volume>
          : Research Track. pp.
          <fpage>27</fpage>
          -
          <lpage>40</lpage>
          . Nagoya,
          <string-name>
            <surname>Japan</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Way</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Pre-reordering for neural machine translation: Helpful or harmful?</article-title>
          <source>The Prague Bulletin of Mathematical Linguistics</source>
          (
          <volume>108</volume>
          ),
          <fpage>171</fpage>
          -
          <lpage>182</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jean</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kyunghyun</surname>
            <given-names>Cho</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.M.</given-names>
            ,
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>On using very large target vocabulary for neural machine translation</article-title>
          .
          <source>In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . Beijing, China (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Junczys-Dowmunt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dwojak</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoang</surname>
          </string-name>
          , H.:
          <article-title>Is neural machine translation ready for deployment? A case study on 30 translation directions</article-title>
          .
          <source>In: Proceedings of the IWSLT</source>
          . Tokyo, Japan (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kalchbrenner</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blunsom</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Recurrent continuous translation models</article-title>
          .
          <source>In: Proceedings of the EMNLP</source>
          . pp.
          <fpage>1700</fpage>
          -
          <lpage>1709</lpage>
          . Seattle, Washington, USA (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Koehn</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Statistical significance tests for machine translation evaluation</article-title>
          .
          <source>In: Proceedings of the EMNLP</source>
          . pp.
          <fpage>388</fpage>
          -
          <lpage>395</lpage>
          . Barcelona,
          <string-name>
            <surname>Spain</surname>
          </string-name>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>L'Hostis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grangier</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Vocabulary selection strategies for neural machine translation</article-title>
          .
          <source>In: arXiv:1610.00072</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Zong</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Towards zero unknown word in neural machine translation</article-title>
          .
          <source>In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)</source>
          . pp.
          <fpage>2852</fpage>
          -
          <lpage>2858</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Luong</surname>
            ,
            <given-names>M.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.:
          <article-title>Achieving open vocabulary neural machine translation with hybrid word-character models</article-title>
          .
          <source>In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics</source>
          . pp.
          <fpage>1054</fpage>
          -
          <lpage>1063</lpage>
          . Berlin, Germany (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Luong</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pham</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.:
          <article-title>Effective approaches to attention-based neural machine translation</article-title>
          .
          <source>In: Proceedings of the EMNLP</source>
          . pp.
          <fpage>1412</fpage>
          -
          <lpage>1421</lpage>
          . Lisbon, Portugal (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Luong</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaremba</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Addressing the rare word problem in neural machine translation</article-title>
          .
          <source>In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing</source>
          . pp.
          <fpage>11</fpage>
          -
          <lpage>19</lpage>
          . Beijing, China (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Mi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ittycheriah</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Vocabulary manipulation for neural machine translation</article-title>
          .
          <source>In: In Proceedings of Annual Meeting of the Association for Computational Linguistics</source>
          . Berlin, Germany (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Papineni</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roukos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ward</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Bleu: A method for automatic evaluation of machine translation</article-title>
          .
          <source>In: Proceedings of the ACL</source>
          . pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . Philadelphia, USA (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Sennrich</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Firat</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Birch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haddow</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hitschler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Junczys-Dowmunt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Läubli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barone</surname>
            ,
            <given-names>A.V.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mokry</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Na˘dejde, M.:
          <article-title>Nematus: a toolkit for neural machine translation</article-title>
          .
          <source>In: arXiv:1703.04357</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Sennrich</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haddow</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Linguistic input features improve neural machine translation</article-title>
          .
          <source>In: Proceedings of the First Conference on Machine Translation</source>
          . pp.
          <fpage>83</fpage>
          -
          <lpage>91</lpage>
          . Berlin, Germany (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Sennrich</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haddow</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Birch</surname>
            ,
            <given-names>A.:</given-names>
          </string-name>
          <article-title>Neural machine translation of rare words with subword units</article-title>
          .
          <source>In: Proceedings of the ACL</source>
          . pp.
          <fpage>1715</fpage>
          -
          <lpage>1725</lpage>
          . Berlin, Germany (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          , ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          :
          <article-title>Sequence to sequence learning with neural networks</article-title>
          .
          <source>In: Proceedings of the 2014 Neural Information Processing Systems</source>
          . pp.
          <fpage>3104</fpage>
          -
          <lpage>3112</lpage>
          . Montreal, Canada (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Toral</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sánchez-Cartagena</surname>
            ,
            <given-names>V.M.:</given-names>
          </string-name>
          <article-title>A multifaceted evaluation of neural versus phrase-based machine translation for 9 language directions</article-title>
          .
          <source>In: Proceedings of the EACL. Valencia</source>
          ,
          <string-name>
            <surname>Spain</surname>
          </string-name>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuster</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          , et al.,
          <string-name>
            <surname>M.N.</surname>
          </string-name>
          :
          <article-title>Google's neural machine translation system: Bridging the gap between human and machine translation</article-title>
          .
          <source>In: arXiv:1609.08144</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Zeiler</surname>
          </string-name>
          , M.D.:
          <article-title>Adadelta: An adaptive learning rate method</article-title>
          . In: CoRR, abs/1212.5701 (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>