<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Incorporating Distinct Translation Statistical and Transformer Model System Outputs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mani Bansal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>D.K.Lobiyal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Jawaharlal Nehru University</institution>
          ,
          <addr-line>Hauz khas South, New Delhi, 110067</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <fpage>394</fpage>
      <lpage>401</lpage>
      <abstract>
        <p>To find correct translation of an input sentence in Machine Translation is not an easy task of Natural language processing (NLP). The hybridization of different translation models has been found to handle this problem in an easy way. This paper presents an approach that takes advantage of various translation models by combining their outputs with statistical machine translation (SMT) and transformer method. Firstly, we achieve Google Translator and Bing Microsoft Translator outputs as external system outputs. Then, outputs of those models are fed into SMT and Transformer. Finally, the combined output is generated by analyzing the Google Translator, Bing, SMT and Transformer output. Prior work used system combination but no such approach exist which tried to combine the statistical and transformer system with other translation system. The experimental results on English-Hindi and Hindi-English language have shown significant improvement.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Machine Translation is the main area of
Natural Language Processing. There are
various translation approaches each with its
pros and cons. One of the recent and existing
approaches of Machine Translation (MT) is
Statistical Machine Translation (SMT). The
Statistical system is [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] structured for
adequacy and handling out-of-vocabulary
words. Neural Machine Translation is a
breakthrough which reduces post-editing
efforts [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and helps in dealing with syntactic
structure of sentence. The NMT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] outputs
more fluent translations. Therefore, we make a
hybrid system by combining Statistical and
Transformer (NMT with multi-head
selfattention architecture) outputs to refine the
machine translation outputs.
      </p>
      <p>
        The combining these approaches into one is
not an easy task. By using either SMT or
Transformer does not give the solution to all
issues. NMT has a problem of over-translates
and under-translates to some extent. Also long
distance dependency, phrase repetitions,
translation adequacy for rare words and word
alignment problems are observed in neural
based system. As SMT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] handles long-term
dependency issues but unable to integrate the
information in the source text. The additional
information in source text helps to
disambiguate the word sense, and named entity
problems. The proposed architecture
performed the experiment on English- Hindi
and Hindi-English dataset. In that, the output
of Bing Microsoft Translator and Google
Translator are given as input to Statistical and
Transformer model to analyze the
improvement in the combined target output. If
the external translator outputs are achieved by
using English to Hindi (Eng-Hi) language pair,
then Statistical and Transformer used the
reverse language pair as input i.e. Hindi to
English (Hi-Eng). Therefore, the output of
external translator can be easily merged with
input of other two systems i.e. Statistical and
Transformer.
      </p>
      <p>The paper is framed as following: In
Section 2, a brief introduction of hybrid
approaches proposed for Machine Translation.
Section 3 elaborates our proposed approach.
The experiments undertaken in this study have
been discussed along with the results obtained
in Section 4. In Section 5, the conclusion is
presented.</p>
    </sec>
    <sec id="sec-2">
      <title>2. RELATED WORK</title>
      <p>
        Many methods have been presented in
literature for machine translation. Researchers
combined different translation techniques [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
to improve translation quality. We have
identified that most of the related studies take
SMT as baseline, very few studies in the
literature show combination with NMT.
      </p>
      <p>
        The Example-based, Knowledge-based and
Lexical transfer system combined using chart
manager in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and selected best group of
edges with the help of chart-walk algorithm
(1994). Authors in the [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] computed a
consensus translation by voting on confusion
network. They produced the word alignments
of original machine translation hypotheses in
pairs for confusion network.
      </p>
      <p>
        Minimum Baye’s risk system combination
(MBRSC) method [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] gathers the benefits of
two methods- combination of sub-sequences
and selection of sentences. These methods use
best subsequences to generate best translation.
      </p>
      <p>
        The lattice-based system combination model
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] entitles for phrase alignments and uses
lattice to encode all candidate translations. The
earlier proposed confusion network processed
word-level translation whereas lattice
expressed n-to-n mappings for phrase-based
translation. The hybrid architecture [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], where
every target hypothesis was paraphrased using
various approaches to obtain fused translations
for each target, and make final selection
among all fused translations.
      </p>
      <p>
        Multi-engine machine translation
amalgamated output of several MT systems
into single corrected translation [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. It
consists of search space, beam search decoder
with its features and many accessories. As
NMT decoding lacks a mechanism to
guarantee all source words to be translated and
favors short translations. Therefore, the authors
in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] incorporates SMT translation model
and n-gram language model under log-linear
framework.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. PROPOSED APPROACH</title>
    </sec>
    <sec id="sec-4">
      <title>3.1. BASIC TRANSFORMER</title>
    </sec>
    <sec id="sec-5">
      <title>MACHINE TRANSLATION</title>
      <p>
        The Transformer model [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] accepts a source
language sentence X = (x1, x2, ..., xN) as an
input and outputs a target language sentence Y
= (y1, y2, ...,yM). The NMT construct a neural
network that translates X into Y by learning
objective function p(Y |X) from a parallel
corpus. The Transformer model is
encoderdecoder model in which the encoder generates
the intermediate representation ht (t = 1, 2, ....,
N) from X (source sentence) and the
decoder generates Y (target sentence) from the
intermediate representation ht:
ht = Transformer Encoder(X) (1)
Y = Transformer Decoder (ht) (2)
The encoder and decoder are made up of
stack of six layers. Each encoder layers
consists of Multi-head and Feed Forward
sublayers. Whereas, each decoder layers is
consists of three sub-layers. Apart from two
sub-layers of encoder, decoder embeds
crosslingual multi-head attention layer for the
encoder stack output.
      </p>
      <p>The attention mechanisms:- Both
selfattention mechanism and cross-lingual
attention are computed as follows:
Attention (Q, K, V) = softmax (
) V
(3)</p>
      <p>Here, Q represent Query vector, V and K
represent as Value and Key vector of both
encoder and decoder respectively and dmodel is
the size of this key vector. The product of
Query and Key represents the similarity
between each element of Q and K and it is
converted to a probability by using the softmax
function, which can be treated as weights of
attention of Q to K.</p>
      <p>The self-attention captures the degree of
association between words in the input
sentence by using the Q, K and V in the
encoder. In similar way, the self-attention in
the decoder captures the degree of association
between words of output sentence by using
Query, Key and Value in the decoder. The
cross-lingual attention mechanism computes
the degree of association between a word in
source and target language sentence by using
Query of decoder and output of last layer of
encoder as Key and Value. In the multiple
head self-attention with h number of heads, Q,
K, and V are linearly projected to h subspaces,
and then the attention function is used in
parallel on each subspace. The concatenation
of these heads is projected to a space with the
original dimension.
MultiHead (Q, K, V) = Concat (head1, …. ,
headh)WO (4)
headi = Attention (Q
, K
, V
)
(5)
where, ∈ , ∈
, ∈ , WO ∈
, are weight matrices and Concat
is a function that concatenates two matrices.
Multiple head attention learns information
from representation spaces at different
positions. The Transformer uses position
encoding (PE) to encode the position related
information of each word in a sentence
because the Transformer does not have any
recurrent or convolution structure. PE is
calculated as follows:
PE (pos, 2i) = sin (pos/
PE (pos, 2i + 1) = cos(pos/
)</p>
      <p>(6)
) (7)
where, i is dimension or size, and pos is
absolute position of the word.</p>
      <p>Fig 1. Translation System Architecture</p>
    </sec>
    <sec id="sec-6">
      <title>3.2. COMBINING MACHINE</title>
    </sec>
    <sec id="sec-7">
      <title>TRANSLATION OUTPUT WITH</title>
    </sec>
    <sec id="sec-8">
      <title>TRANFORMER MODEL</title>
      <p>The transformer system inputs are the
translated outputs of different or external
translation methods and same source sentence
as shown in Fig 1. Then, we used three
encoders: one for Google output text, another
uses Bing and source language encoder. The
concatenation generated by the three encoders
is fed into conventional Transformer decoder.</p>
      <p>The Google output text encoder represented
as X1 = (x11, x12, ..., x1N) input. The Bing
output sentence represented as X2 = (x21, x22,
..., x2N) and third transformer encoder accepts
a source language sentence X3 = (x31, x32, ...,
x3N) as an input. Then, Transformer encoder
generates the intermediate representation hg (t
= 11,...., 1N), hb (p = 21,...., 2N), ht (t = 31,....,
3N), from the source language sentence X1, X2
and X3. The intermediate representation hg, hb,
ht:
hg = Transformer Encoder(X1)
hb = Transformer Encoder(X2)
ht = Transformer Encoder(X3)
(8)
(9)
(10)</p>
      <p>
        After the intermediate representations of
inputs, these are concatenated in the
composition layer [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The concatenation h is
the output of the proposed encoder and is fed
to the decoder of our model.
      </p>
      <p>h = Concat (hg, hb, ht)
(11)</p>
      <p>The decoder generated the combined target
language output using the above expressions.</p>
    </sec>
    <sec id="sec-9">
      <title>3.3. BASIC STATISTICAL</title>
    </sec>
    <sec id="sec-10">
      <title>TRANSLATION</title>
    </sec>
    <sec id="sec-11">
      <title>MACHINE</title>
      <p>
        The phrase translation method or Baye’s Rule
forms the basis of Statistical Translation [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
The best translation output sentence ebest is
formulated as follows:
      </p>
      <p>ebest = argmaxe P(e|f) = argmaxe [P(f|e)
PLM(e)] (12)
where, f is source sentence and e is target
sentence. PLM(e) and P(f |e) are language
model (LM) and the translation model (TM) ,
respectively. The input text f is partitioned
uniformly into a sequence of T phrases .
Each foreign phrase in is translated into
english phrase. The translation model P(f|e)
is disintegrated into:</p>
      <p>P( | ) =
d (αi – βi-1) (13)</p>
      <p>The phrase translation is formed by
probability distribution . The relative
distortion probability distribution d (αi – βi-1)
calculates the output phrases.</p>
    </sec>
    <sec id="sec-12">
      <title>3.4. COMBINING MACHINE</title>
    </sec>
    <sec id="sec-13">
      <title>TRANSLATION OUTPUT WITH</title>
    </sec>
    <sec id="sec-14">
      <title>STATISTICAL SYSTEM</title>
      <p>The Statistical combination approach uses
three modules: Alignment module, Decoding
and Scoring. The alignment is useful for string
alignments of the hypotheses generated from
different machine translation systems. A
decoding step builds hypotheses using aligned
strings from previous step by using beam
search algorithm. The final scoring step helps
in estimating the final hypotheses.</p>
    </sec>
    <sec id="sec-15">
      <title>3.4.1. ALIGNMENT</title>
      <p>
        The single best outputs d1, d2, ….dm from each
of the m participating systems are taken into
consideration. We take sentence pairs di and dj,
and strings between the sentence pairs are
aligned. For m sentences,
possible
sentence pairs are required to be aligned. The
string w1 in sentence di aligned to string w2 in
sentence dj following two conditions:- Firstly,
w1 and w2 are same. Then, w1 and w2 have Wu
and PaLMer [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] similarity score &gt; δ.
METEOR [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] is used to align sentences
configuration.
      </p>
    </sec>
    <sec id="sec-16">
      <title>3.4.2. DECODING</title>
      <p>In decoding, best outputs are combined of
participating systems to form a set of
hypothesis. The first word of a sentence is used
to start the hypothesis. At any moment of time,
the search can be shifted to a different sentence
or addition of the new words continued using
words from the previous sentence. Let a word
w is added to the hypothesis, taken from the
best output di or shift to different output dj. On
shifting, the first left over word from best
output sentence is added to next hypothesis.
With the help of this method, a hypothesis can
be made using various system outputs. If a
hypothesis made up of at most single word
from each set of aligned words, there is less
possibility of occurrence of duplication.</p>
      <p>The search space is easily switched through
sentences, and thus maintaining adequacy and
fluency is difficult. Therefore, hypothesis
length, language model probability and
number of n-gram matching between
individual system’s output and hypothesis,
features are used for complete hypothesis.</p>
    </sec>
    <sec id="sec-17">
      <title>3.4.3. BEAM SEARCH</title>
      <p>
        In this search, the number of equal length
hypotheses is assigned to beam. The
hypotheses are recombined by feature state,
history of the hypothesis appropriate to the
length requested by features and search space
hashing. Then, pointers are maintained of
recombined hypotheses that are packed into
single hypothesis. Therefore, it enabled
extraction of k-best.
3.4.4. SCORING
In the output of decoding steps, the m-best lists
are generated. Language Model Probability
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and Word Mover’s Distance [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] methods
are used to calculate scoring of m-best list. The
m-best list is represented as h1, h2, h3,…., hp
and the score of each hi is calculated. The
minimum error rate [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] training (MERT)
method is used to calculate the weights.
      </p>
    </sec>
    <sec id="sec-18">
      <title>4. EXPERIMENTATION AND</title>
    </sec>
    <sec id="sec-19">
      <title>RESULTS</title>
      <p>
        The proposed approach is tested on
HindEnCorp [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] dataset. It contains 273,880
sentences. For preparing training and
development set, we use 272,880 (267,880 for
training + 5k for tuning) sentences for
statistical system and transformer. The test set
contains 1k sentences. The output of Google
Translate1 and Bing Microsoft Translate2 is
combined with SMT and transformer. Our
proposed architecture should be trained along
with the outputs of various translated
sentences.
      </p>
      <p>We trained and tested our approach on one
more dataset from ILCC (Institute for
Language, Cognition and Computation) for
English to Hindi language which contains
43,396 sentences. For Hindi to English
translation, we used TDIL-LC (Indian
Language Technology Proliferation and
Deployment Centre) dataset divided into
tourism (25k sentences) and health (25k
sentences) domain. Therefore, Hindi to English
language pair trained and tested on 50k
sentences.</p>
      <p>
        We train Statistical Machine Translation
with KneserNey smoothing [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] for probability
distribution of 4-gram language model (LM)
by using IRSTLM [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. The Moses decoder
1 https://translate.google.com/
2 https://www.bing.com/translate
[
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] finds highest scoring sentence for
phrasebased system. The model learns the heuristics
using GIZA++ [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] and word alignment with
gro-diag-final.
      </p>
      <p>
        The Transformer1 system contains encoder
and decoder six layers, eight attention heads,
and 2048 feed-forward inner-layer size or
dimensions with dropout = 0.1. The hidden
state and word embedding dimension dmodel is
512. We limit maximum sentence length to
256, and input and the output tokens per batch
are limited to 2048. We used Adam optimizer
[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] with β1 = 0.90, β2 = 0.98 and ϵ = 10-9.
Further, we used length penalty α = 0.6 and
beam search with a size of 4.
      </p>
      <p>
        The Bilingual Evaluation Understudy
(BLEU) [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] selected as primary evaluation
algorithm. It evaluates the quality of machine
translated text with that of human translation.
Scores in BLEU are calculated for translated
sentences by comparing with good quality
reference sentences. The scores are
normalized over the complete dataset to
estimate overall quality. BLEU calculates the
score always between 0 and 1, but it shows
the score in percentage form. If BLEU score
are more close to 1 better is the accuracy of
the system.
      </p>
      <p>The results on different test sets are obtained
for Hindi to English (Hi-Eng) and
EnglishHindi (Eng-Hi) language pairs. It is evident
from Table1 that translation system
combination shows better results than
individual system i.e. SMT with Google and
Bing improved approximately 4 bleu scores
than SMT alone in all language pairs. The
output from individual system contains some
erroneous or un-translated words. But the
selection of best phrase among different
translated outputs (Google and Bing) generated
by participating systems makes the target
sentence more accurate.
1 https://github.com/tensorflow/tensor2tensor</p>
      <p>BLEU Score
Eng-Hi
(ILCC)</p>
      <p>We also observe in the Table1 that by
increasing number of MT system does not help
in improving accuracy i.e. Bleu scores of
SMT, Transformer, Google and Bing together
achieved 2 points less than SMT, Google and
Bing. The BLEU score achieved by using
SMT, Bing Microsoft and Google translator
together are highest. Also the scores of
Transformer, Google and Bing are better than
using all translation models and bleu scores
improved by 1 point. The scores retrieved
using TDIL-DC and ILCC dataset are lesser
than HindiEnCorp because the size of dataset
is very less. The overall accuracy of our
translation output using reverse language pair
is improved by combining the better parts of
outputs. But, the Bleu scores are not improved
much in our approach. The main reason is that
the error occurred in external machine
translation systems, would also reflect in the
combination approach. Hence, by removing
these errors, we will try to achieve better
results in future.</p>
    </sec>
    <sec id="sec-20">
      <title>5. CONCLUSION</title>
      <p>We investigated the approach of combining the
various translation outputs with statistical
machine translation and Transformer which
improve the final translation in this paper. The
proposed method increased the complexity of
the overall system. Experimentation on Hindi
to English and from English to Hindi shows
that incorporating different system output
achieves better result than individual system.
In future, we extend this approach for
translation of other language pairs and tasks
like text abstraction, sentence compression.
We will also try to incorporate BERT model
into the neural based English to Hindi
translation and will also explore the graph
based encoder-decoder translation methods.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Statistical</given-names>
            <surname>Translation</surname>
          </string-name>
          <article-title>Approach by Network Model, in: Recent Developments in Intelligent Computing, Communication</article-title>
          and Devices, Springer, Singapore (
          <year>2019</year>
          ), pp.
          <fpage>325</fpage>
          -
          <lpage>331</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Toral</surname>
          </string-name>
          , Martijn Wieling, and
          <string-name>
            <given-names>Andy</given-names>
            <surname>Way</surname>
          </string-name>
          .
          <article-title>Post-editing effort of a novel with statistical and neural machine translation</article-title>
          .
          <source>Frontiers in Digital Humanities</source>
          ,
          <volume>5</volume>
          .9(
          <year>2018</year>
          ). doi:
          <volume>10</volume>
          .3389/fdigh.
          <year>2018</year>
          .
          <volume>00009</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C. Zong,
          <source>Neural System Combination for Machine Translation. arXiv preprint arXiv:1704.06393</source>
          (
          <year>2017</year>
          ).doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P17</fpage>
          -2060
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <source>Improved Neural Machine Translation with SMT Features, in: Thirtieth AAAI conference on artificial intelligence</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Attri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Prasad</surname>
          </string-name>
          , G. Ramakrishna,
          <string-name>
            <surname>G</surname>
          </string-name>
          , HiPHET
          <article-title>: A Hybrid Approach to Translate Code Mixed Language (Hinglish) to Pure Languages (Hindi and English)</article-title>
          .
          <source>Computer Science</source>
          ,
          <volume>21</volume>
          .3 (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .7494/csci.
          <year>2020</year>
          .
          <volume>21</volume>
          .3.3624.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nirenburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frederking</surname>
          </string-name>
          ,
          <article-title>Toward Multi-Engine Machine Translation</article-title>
          .
          <source>In Human Language Technology: Proceedings of a Workshop</source>
          , Plainsboro, New Jersey (
          <year>1994</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Matusov</surname>
          </string-name>
          , U. Nicola,
          <article-title>Computing Consensus Translation for Multiple Machine Translation Systems using Enhanced Hypothesis Alignment</article-title>
          .
          <source>in: 11th Conference of the European Chapter of the Association for Computational Linguistics</source>
          , (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>González-Rubio</surname>
          </string-name>
          , C. Francisco, Minimum Bayes'
          <article-title>Risk Subsequence Combination for Machine Translation</article-title>
          .
          <source>Pattern Analysis and Applications</source>
          (
          <year>2015</year>
          )
          <fpage>523</fpage>
          -
          <lpage>533</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10044-014-0387-5.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          , Y. Liu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Lattice-Based System Combination for Statistical Machine Translation</article-title>
          .
          <source>in: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>1105</fpage>
          -
          <lpage>1113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.Y.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>McKeown, System Combination for Machine Translation through Paraphrasing</article-title>
          .
          <source>in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1053</fpage>
          -
          <lpage>1058</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D15</fpage>
          -1122.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Sentence-Level Paraphrasing for Machine Translation System Combination</article-title>
          . in: International Conference of Pioneering Computer Scientists, Engineers and Educators, Springer, Springer, Singapore,
          <year>2016</year>
          , pp.
          <fpage>612</fpage>
          -
          <lpage>620</lpage>
          . doi:
          <volume>10</volume>
          .1007/
          <fpage>978</fpage>
          -981-
          <fpage>10</fpage>
          -2053-
          <volume>7</volume>
          _
          <fpage>54</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Heafield</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Lavie,</surname>
          </string-name>
          <article-title>Combining Machine Translation Output with Open Source: The Carnegie Mellon MultiEngine Machine Translation Scheme</article-title>
          .
          <source>The Prague Bulletin of Mathematical Linguistics</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>27</fpage>
          -
          <lpage>36</lpage>
          . doi:
          <volume>10</volume>
          .2478/v10108-010-0008-4.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is All You Need</article-title>
          .
          <source>in: Advances in Neural Information Processing Systems</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Currey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Heafield</surname>
          </string-name>
          ,
          <article-title>Incorporating source syntax into transformer-based neural machine translation</article-title>
          .
          <source>in: Proceedings of the Fourth Conference on Machine Translation</source>
          , vol.
          <volume>1</volume>
          ,
          <issue>2019</issue>
          , pp.
          <fpage>24</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>P. Koehn,</surname>
          </string-name>
          <article-title>Statistical machine translation</article-title>
          . Cambridge University Press,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Palmer Verb Semantics and Lexical Selection</article-title>
          .
          <source>arXiv preprint cmplg/9406033</source>
          (
          <year>1994</year>
          ).doi:
          <volume>10</volume>
          .3115/981732.981751
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <string-name>
            <surname>METEOR:</surname>
          </string-name>
          <article-title>An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments</article-title>
          .
          <source>in: Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brants</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.C.</given-names>
            <surname>Popat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.J.</given-names>
            <surname>Och</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <source>Large Language Models in Machine Translation</source>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kusner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kolkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          , From Word Embeddings to Document Distances.
          <source>in: International Conference on Machine Learning</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>957</fpage>
          -
          <lpage>966</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>O.</given-names>
            <surname>Zaidan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z-MERT</surname>
          </string-name>
          :
          <article-title>A Fully Configurable Open Source Tool for Minimum Error Rate Training of Machine Translation Systems</article-title>
          .
          <source>The Prague Bulletin of Mathematical Linguistics</source>
          ,
          <volume>91</volume>
          .1 (
          <year>2009</year>
          ):
          <fpage>79</fpage>
          -
          <lpage>88</lpage>
          . doi:
          <volume>10</volume>
          .2478/v10108-009-0018-2.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>O.</given-names>
            <surname>Bojar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Diatka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rychlỳ</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stranák</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Suchomel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tamchyna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zeman</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hindencorp-</surname>
          </string-name>
          Hindi-English and
          <article-title>Hindi-only Corpus for Machine Translation</article-title>
          .
          <source>in: Proceedings of the 9th International Conference on Language Resources and Evaluation</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>3550</fpage>
          -
          <lpage>3555</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kneser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ney</surname>
          </string-name>
          ,
          <article-title>Improved Backing-off for m-gram Language Modeling</article-title>
          . In: 1995 International Conference on Acoustics, Speech, and Signal Processing, IEEE,
          <year>1995</year>
          , vol.
          <volume>1</volume>
          , pp.
          <fpage>181</fpage>
          -
          <lpage>184</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICASSP.
          <year>1995</year>
          .479394
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Federico</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bertoldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cettolo</surname>
          </string-name>
          ,
          <string-name>
            <surname>IRSTLM:</surname>
          </string-name>
          <article-title>An Open Source Toolkit for Handling Large Scale Language Models</article-title>
          .
          <source>in: Ninth Annual Conference of the International Speech Communication Association</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>H.</given-names>
            <surname>Hoang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Koehn</surname>
          </string-name>
          ,
          <article-title>Design of the Moses Decoder for Statistical Machine Translation</article-title>
          .
          <source>in: Proceedings of Software Engineering, Testing, and Quality Assurance for Natural Language Processing</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>58</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Och</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ney</surname>
          </string-name>
          ,
          <article-title>A Systematic Comparison of Various Statistical Alignment Models</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>29</volume>
          .1 (
          <issue>2003</issue>
          ),
          <fpage>19</fpage>
          -
          <lpage>51</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>D.P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.J.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <article-title>A method for Stochastic Optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W. Zhu,
          <article-title>BLEU: a method for automatic evaluation of machine translation</article-title>
          .
          <source>in: Proceedings of the 40th annual meeting on association for computational linguistics</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . doi:
          <volume>10</volume>
          .3115/1073083.1073135.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>