<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Debugging Neural Machine Translations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mat¯ıss Rikters</string-name>
          <email>matiss.rikters@tilde.lv</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Tilde Vien ̄ıbas gatve 75A, Riga</institution>
          ,
          <addr-line>Latvia, LV-1004</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe a tool for debugging the output and attention weights of neural machine translation (NMT) systems and for improved estimations of confidence about the output based on the attention. The purpose of the tool is to help researchers and developers find weak and faulty example translations that their NMT systems produce without the need for reference translations. Our tool also includes an option to directly compare translation outputs from two different NMT engines or experiments. In addition, we present a demo website of our tool with examples of good and bad translations: http://attention.lielakeda.lv.</p>
      </abstract>
      <kwd-group>
        <kwd>Neural machine translation</kwd>
        <kwd>Visualization tool</kwd>
        <kwd>Attention mechanism</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        As one of the primary use-cases for the modern computer - automated translation
of texts from one language into another or machine translation (MT) has evolved
vastly since its early days in the 1950s. There have been several large paradigm
shifts that have greatly impacted the field of MT - rule-based MT (RBMT),
statistical MT (SMT) and neural network MT (NMT) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. With each paradigm
shift detailed understanding of how the system produces its final translation
has changed from fully clear in the case of RBMT to slightly less, but often
still predictable in SMT, to often completely unpredictable in NMT. Many of
the existing tools for inspecting results of statistical phrase-based approaches
are either not compatible or serve little purpose in dealing with neural network
generated output.
      </p>
      <p>In this paper, we propose a tool for browsing, inspecting and comparing
translations specifically designed for NMT output. The tool uses the attention
weights that correspond to specific token pairs which are generated during the
decoding process, by turning them into one of several visual representations that
can help humans better understand how the output translations were produced.
Aside from just visualizing attention alignments, the tool also uses them to
estimate the confidence in translation, which allows to distinguish acceptable
outputs from completely unreliable ones. For this no reference translations are
required.</p>
      <p>The structure of this paper is as follows: Section 1.1 summarizes related work
on tools for inspecting translation outputs and alignments; Section 2 introduces
some concepts of the baseline tool - how it scores translations and displays the
visualizations in different environments, as well as outlines the improvements
made to make it more useful for debugging machine translation output. In section
3 we give an overview of how to make the most use of our tool in finding odd
translations, what to look for when comparing them and possible causes of errors.
Finally, we conclude in Section 4 and introduce plans for directions of future work
and research in the area.
1.1</p>
      <p>
        Related Work
The foundation of our tool is based on the paper of Rikters et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], who
introduce visualization of NMT attention and use attention-based scoring of
NMT as described by Rikters and Fishel [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. While in general it can be useful
to quickly find sentences with “scrambled" attention alignments, it does have
several flaws like considering completely untranslated sentences as good. This
consistently misleads users when sorting data sets by confidence and looking
for the highest scoring examples. Another shortcoming is the ability to only
visualize a translation from one system at a time, making it slightly tricky to
directly compare how multiple systems handle the same inputs.
      </p>
      <p>
        In contrast, both iBLEU [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] a web-based tool for visualizing BLEU [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
scores and MT-ComparEval [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] which builds upon iBLEU by adding
supplementary visualizations, scores and metrics can easily work with multiple
MT outputs and even a set of human references. A downside for these tools
is that the reference translation set is always mandatory and can’t be left out.
While it is useful to verify how the system performs in a controlled environment
(when the expected result - reference translations - is known beforehand), more
often than not the strangest abnormalities appear when using arbitrary data.
      </p>
      <p>
        NMT frameworks like Nematus [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], Neural Monkey [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or OpenNMT [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
have some forms of visualization, but they mainly handle representation of the
translation process instead of the translation results. For instance, OpenNMT
has a separate repository for visualization tools1 that can generate visualizations
of embeddings or beam search. Neural Monkey utilizes the built-in visualizations
of TensorFlow [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that can show the compute graph and multiple types of
histograms from the training progress.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Visualization Tool</title>
      <p>
        The basis of our visualization tool is described in full detail in the baseline paper
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. It requires source and translated sentences along with the corresponding
attention alignments from NMT systems as input files and can provide a visual
overview in a command line environment (Linux Terminal or Windows
Powershell) or a web browser of any modern device. It is published in a GitHub
      </p>
      <sec id="sec-2-1">
        <title>1 VisTools - https://github.com/OpenNMT/VisTools</title>
        <p>repository2 and open-sourced with the MIT License. In the further subsections
of the paper, we will outline only core components and focus more on highlighting
improvements and differences.</p>
        <p>
          In addition to Nematus, Neural Monkey and Marian3 [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], we have also added
out-of-the-box support for working with attention alignments from OpenNMT
and Sockeye4 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] frameworks.
2.1
        </p>
        <p>Confidence Scores
This section outlines how the confidence scores are calculated and outlines what
is how the final score differs from the baseline.</p>
        <p>The four main metrics that we use for scoring translations are:
– Coverage Deviation Penalty (CDP) penalizes attention deficiency and
excessive attention per input token.</p>
        <p>CDP =
1</p>
        <p>X log 1 + 1
Ls j</p>
        <p>X
i
ji
2!
– Absentmindedness Penalties (APout, in) penalize output tokens that pay
attention to too many input tokens, or input tokens that produce too many
output tokens.
(1)
(2)
(3)
APout =
APin =
1 X X
Ls i
1 X X
Ls j
j
i
ji log ji
ij log ij
– Overlap Penalty (OP) penalizes translations that copy large fractions from
source sentences. A stronger penalty is allocated to longer sentences that
copy large amounts from the source while shorter ones get more tolerance
(e.g., the three-word English sentence “Thanks Barack Obama." can be
perfectly translated into “Paldies Barack Obama." although 2/3 of words in the
translation are the same in the source).</p>
        <p>OP = (0:8 + (Lt 0:01)) (3
((1</p>
        <p>S) 5)) (0:7 + S) tan(S)
(4)
– Confidence is the sum of the three main metrics – CDP, APin and APout
and the similarity penalty, when the similarity between input and output
sentences is high (similarity &gt; 0.3) .</p>
      </sec>
      <sec id="sec-2-2">
        <title>2 NMT Attention Alignment</title>
        <p>SoftAlignments
3 Marian: https://github.com/marian-nmt/marian
4 Sockeye: https://github.com/awslabs/sockeye</p>
        <p>Visualizations:
https://github.com/M4t1ss/
Source: Kepler measures spin rates of stars in Pleiades cluster
Hypothesis: Kepler measures spin rates of stars in Pleiades cluster
Reference: Keplers izm¯era zvaigˇzn¸u grieˇsana¯s ¯atrumu Pleja¯des zvaigzna¯ja¯.</p>
        <p>In all of the metrics Ls is the length of the source sentence; Lt - length of the
target sentence; S - similarity between the source sentence and the translation
on the scale of 0 - 1; ji - the attention weight between source token i and
translation token j.</p>
        <p>Changes have been introduced to the final confidence score by first
calculating the similarity ratio between input and output sentences and then adding a
further penalty only if the similarity is high enough. The similarity is calculated
by finding the longest contiguous matching subsequence.</p>
        <p>Since the baseline confidence score considered only the attention alignments
when calculating the final value, examples like shown in Figure 1 received
particularly high values due to consistent one-to-one attention alignments. The
updated score takes care of this problem by penalizing hypothesis sentence that is
overly similar to the input source.
The web interface is the primary point of interaction with the tool. Aside from
browsing visualizations, ordering data sets by confidence scores and exporting
visualizations as images, that are all clarified in the baseline paper, we introduce
several significant changes to the system. The first one is a technical update on
how data is served loading is performed asynchronously in the background
and thereby eliminating long wait times to view the proceeding sentences in a
large data set. The three major additions are:
– the addition of source-translation overlap percentage alongside the four base
scores (Section 2.3);
– the ability to provide reference translations, if available, to display next to
the hypothesis and calculate BLEU scores (Section 2.4);
– the ability to directly compare translations and alignments from two different</p>
        <p>NMT systems (Section 2.5).
2.3</p>
        <p>Overlap
As mentioned in Section 2.1, the updated confidence score considers hypotheses
translations that are long and have a significant overlap with the source
sentence as a worse translations, while tolerating considerable overlap for shorter
sentences. In addition to contributing to the final confidence score, the overlap
ratio has been added as an individual score for sorting, navigating and comparing
sentences from a data set as shown in Figure 2. The system also underlines the
longest matching substring between the source and translation in cases where
the overlap is high enough (over 10%). An example is shown in Figure 2, where
the overlap ratio is 20.19%.</p>
        <p>Source: see 0,2 mg/ml kuni 0,8 mg/ml ( 0,9 mg/ml Ku¨prosel ) ning mo˜nedes
riikides ei tohi so˜iduki juhtimise ajal veres u¨ldse alkoholi olla.</p>
        <p>Hypothesis: на 0,2 mg/ml до 0,8 mg/ml ( 0,9 mg/ml на Кипре ) , и в некоторых
странах в крови не может быть алкоголя.</p>
        <p>Match: 0,8 mg/ml ( 0,9 mg/ml
We believe that simply displaying the reference next to the hypothesis is helpful
more often than not. Having provided references also allows to calculate BLEU
scores for the translations, providing yet another dimension for sorting (Figure
2). Unlike overlap, the BLEU scores do not influence the overall confidence scores.
2.5</p>
        <p>Comparing Translations
The final major addition to the tool is the option to directly compare two
translations of the same source sentence. To perform the comparison, all source
sentences for both input data sets must match, but the target sentences may differ
in output token order as well as count. Comparisons may be performed
between translations obtained from any two of the five currently supported NMT
frameworks (Nematus, Neural Monkey, OpenNMT, Marian and Soceye) or even
an arbitrary input file, as long as it’s formatted according to the specification
provided in the readme 5.</p>
        <p>Figure 3 shows an example comparison of a sentence translated by two
different NMT systems. On the top row is the source text and the bottom rows
represent output from each individual NMT system color-coded to match the
colors of the alignment lines. The second hypothesis (in green) exhibits stronger
and more reliable output alignments to the content words while the first shows
strong alignments coming from the stop sign. In this example neither hypothesis
matches the reference, but since it is only two words long for a source sentence
of triple the length, it can hint to an oversimplified translation by the translator
(assuming English was the original) and does not mean that both hypotheses are
completely wrong. In fact, the second hypothesis is a fairly decent representation
of the source sentence.</p>
        <p>Figure 4 illustrates another example with strong attention alignments and a
high overlap ratio (94.03%) between source and translated sentences from one
system compared to a weak, but at least better translation from another system.
The final confidence score for the second translation is strongly influenced by
the high overlap, even though the sentence is not particularly long. In similar
conditions, the confidence score of the second hypothesis calculated by the
baseline system would be very close to 100% due to its complete disregard for the
actual words of the source and hypothesis sentences.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Recipes for Debugging</title>
      <p>In this section we summarise several tips and tricks that may come in handy
when using the tool to look for faulty translations of various kinds. Here we
also list common causes associated with the problems. Some peculiarities to pay
attention to may include:
– Short sentences with a low confidence, CDP, APin or APout</p>
      <p>All of the metrics do not necessarily need to be low, but translations that
exhibit at least one of them to be under 30% are often worth looking into.
5 Using other input formats - https://github.com/M4t1ss/SoftAlignments#
how-to-get-alignment-files-from-nmt-systems
Source: the loss was by the team.</p>
      <p>Hypothesis 1: zaud¯ejums bija komandas biedrs.</p>
      <p>Hypothesis 2: ˇsis zaud¯ejums bija komandai.</p>
      <p>Reference: zaud¯e komanda.</p>
      <p>As stated before, for short, several words long sentences it may be completely
normal to have an overlap of 50% or more, but if it occurs in sentences that
are 10 or more words long, it may indicate that the system has only partially
translated the source or not translated anything at all. When completely
untranslated sentences are found, it is worth checking the training data for
any source-target sentence pairs that are equal. Removing them from the
training data should help.
– Sentences with a low BLEU score, but normal or even high confidence, CDP,
APin and APout
The BLEU metric has its flaws and one of them is comparing each hypothesis
to only one reference, while it is often possible to translate the same sentence
in several different ways. In cases when the only low-scoring metric output
by the tool is the BLEU score, it is often that the translation is perfectly
good, but just different from the reference. Such sentences are often useful
examples to show that lower BLEU scores of neural MT systems do not
necessarily represent lower quality translations and are cheaper to find than
performing full manual human evaluations.</p>
      <p>A separate recommendation specifically for comparing two translations is to
look at the attention alignment lines and try to find ones with source tokens
having strong alignments to different hypothesis tokens, while maintaining relatively
similar confidence scores. Such translations are often synonyms.
Source: they did so just in time as Hindes emerged.</p>
      <p>Hypothesis 1: vin¸i to dar¯ıja tikai toreiz , kad para¯d¯ıj¯as hinduisti.</p>
      <p>Hypothesis 2: it did so just in time as Hindes emerged.</p>
      <p>Reference: vin¸iem tas izdeva¯s p¯ed¯eja¯ br¯ıd¯ı.
In this paper, we described our conversion of a visualization tool into an
instrument for debugging output form neural machine translation systems by
improving the attention alignment scoring and confidence estimation of the baseline.
The tool is intended to help researchers better understand how their systems
perform by enabling to quickly locate better and worse translations in a arbitrary
test sets. Compared to other similar tools, ours relies on the confidence scores
and does not require reference translations to facilitate this easier navigation,
but it only benefits with additional features that are enabled when the references
are provided. This allows to integrate it, for example, in an NMT system with
a web interface, providing users with an explanation for the result of a specific
translation.</p>
      <p>
        In a future version of the system we may include other reference-based MT
scoring metrics for more variety of scoring and sorting. Some examples of metrics
may include chrF [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] or TER [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Another idea for future work would be to
list and order specific best, worst or interesting examples of translations. This
could be done by considering the recipes from Section 3.
      </p>
      <p>
        In addition to the reference-based metrics, there still are some
referenceless approaches yet to be utilised. For instance, borrowing ideas from parallel
corpora filtering [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] such as 1) source-hypothesis sentence length difference; 2)
language identification for the hypothesis; 3) digit mismatch between the source
and hypothesis; 4) foreign or corrupt symbol checking for the hypothesis.
      </p>
      <p>
        Another ongoing challenge is to find a way of better representing attention
alignments generated by multi-layer neural networks. While in recurrent neural
network NMT systems this is rarely a problem, more modern approaches like
convolution neural networks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and transformer neural networks [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] require
training of deeper models to achieve competitive quality translation results. This,
however, results in each layer paying attention only to a subset of the input
sentence. Even when all attentions are summed up, the result looks like every
source token is connected to every hypothesis token as can be seen in Figure 5.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>The research has been supported by the European Regional Development Fund
within the research project ”Neural Network Modelling for Inflected Natural
Languages” No. 1.1.1.1/16/A/215.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barham</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brevdo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Citro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Devin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al.:
          <article-title>Tensorflow: Large-scale machine learning on heterogeneous distributed systems</article-title>
          .
          <source>arXiv preprint arXiv:1603.04467</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bahdanau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.:</given-names>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          .
          <source>CoRR abs/1409</source>
          .0473 (
          <year>2014</year>
          ), http://arxiv.org/abs/1409. 0473
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gehring</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grangier</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yarats</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dauphin</surname>
            ,
            <given-names>Y.N.</given-names>
          </string-name>
          :
          <article-title>Convolutional sequence to sequence learning</article-title>
          .
          <source>arXiv preprint arXiv:1705.03122</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Helcl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Libovicky´,
          <string-name>
            <given-names>J.: Neural</given-names>
            <surname>Monkey</surname>
          </string-name>
          :
          <article-title>An open-source tool for sequence learning</article-title>
          .
          <source>The Prague Bulletin of Mathematical</source>
          Linguistics pp.
          <fpage>5</fpage>
          -
          <lpage>17</lpage>
          (
          <year>2017</year>
          ). https://doi.org/10.1515/pralin-2017
          <source>-0001</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hieber</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domhan</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Denkowski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vilar</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sokolov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clifton</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Post</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Sockeye: A Toolkit for Neural Machine Translation</article-title>
          . ArXiv e-prints (
          <year>Dec 2017</year>
          ), https://arxiv.org/abs/1712.05690
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Junczys-Dowmunt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grundkiewicz</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dwojak</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heafield</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neckermann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seide</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Germann</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aji</surname>
            ,
            <given-names>A.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bogoychev</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martins</surname>
            ,
            <given-names>A.F.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Birch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Marian: Fast neural machine translation in c++</article-title>
          . arXiv preprint arXiv:
          <year>1804</year>
          .
          <volume>00344</volume>
          (
          <year>2018</year>
          ), https://arxiv.org/abs/
          <year>1804</year>
          .00344
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senellart</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rush</surname>
            ,
            <given-names>A.M.:</given-names>
          </string-name>
          <article-title>OpenNMT: Open-Source Toolkit for Neural Machine Translation</article-title>
          . ArXiv e-prints (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Klejch</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Avramidis</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burchardt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Mt-compareval: Graphical evaluation interface for machine translation development</article-title>
          .
          <source>The Prague Bulletin of Mathematical Linguistics</source>
          <volume>104</volume>
          (
          <issue>1</issue>
          ),
          <fpage>63</fpage>
          -
          <lpage>74</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Madnani</surname>
          </string-name>
          , N.:
          <article-title>ibleu: Interactively debugging and scoring statistical machine translation systems</article-title>
          .
          <source>In: Semantic Computing (ICSC)</source>
          ,
          <year>2011</year>
          Fifth IEEE International Conference on. pp.
          <fpage>213</fpage>
          -
          <lpage>214</lpage>
          . IEEE (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Papineni</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roukos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ward</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>BLEU: a method for automatic evaluation of machine translation</article-title>
          . . . .
          <source>of the 40Th Annual Meeting on . .</source>
          . pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          (
          <year>2002</year>
          ). https://doi.org/10.3115/1073083.1073135, http://dl.acm.org/citation. cfm?id=
          <fpage>1073135</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Pinnis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kriˇslauks</given-names>
            , R.,
            <surname>Miks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Deksne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Sˇics</surname>
          </string-name>
          , V.:
          <article-title>Tilde's machine translation systems for wmt 2017</article-title>
          .
          <source>In: Proceedings of the Second Conference on Machine Translation</source>
          , Volume
          <volume>2</volume>
          :
          <string-name>
            <given-names>Shared</given-names>
            <surname>Task</surname>
          </string-name>
          <article-title>Papers</article-title>
          . pp.
          <fpage>374</fpage>
          -
          <lpage>381</lpage>
          . Association for Computational Linguistics, Copenhagen, Denmark (
          <year>September 2017</year>
          ), http://www.aclweb. org/anthology/W17-4737
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Popovi´c, M.:
          <article-title>chrf: character n-gram f-score for automatic mt evaluation</article-title>
          .
          <source>In: Proceedings of the Tenth Workshop on Statistical Machine Translation</source>
          . pp.
          <fpage>392</fpage>
          -
          <lpage>395</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Rikters</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fishel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bojar</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Visualizing neural machine translation attention and confidence</article-title>
          .
          <source>The Prague Bulletin of Mathematical Linguistics</source>
          <volume>109</volume>
          (
          <issue>1</issue>
          ),
          <fpage>39</fpage>
          -
          <lpage>50</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Rikters</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fishel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Confidence through attention</article-title>
          .
          <source>In: Proceedings of The 16th Machine Translation Summit</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Sennrich</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Firat</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Birch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haddow</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hitschler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , JunczysDowmunt,
          <string-name>
            <surname>M.</surname>
          </string-name>
          , La¨ubli,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Barone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.V.M.</given-names>
            ,
            <surname>Mokry</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          , et al.:
          <article-title>Nematus: a toolkit for neural machine translation</article-title>
          .
          <source>EACL</source>
          <year>2017</year>
          p.
          <volume>65</volume>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Snover</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dorr</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwartz</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Micciulla</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Makhoul</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A study of translation edit rate with targeted human annotation</article-title>
          .
          <source>In: Proceedings of association for machine translation in the Americas</source>
          . vol.
          <volume>200</volume>
          .
          <string-name>
            <surname>Citeseer</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uszkoreit</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>A.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaiser</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polosukhin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Attention is all you need</article-title>
          .
          <source>CoRR abs/1706</source>
          .03762 (
          <year>2017</year>
          ), http: //arxiv.org/abs/1706.03762
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>