<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Series</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Czechizator - Cˇ echizátor</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rudolf Rosa</string-name>
          <email>rosa@ufal.mff.cuni.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics</institution>
          ,
          <addr-line>Malostranské námeˇstí 25, 118 00 Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>1649</volume>
      <fpage>74</fpage>
      <lpage>79</lpage>
      <abstract>
        <p>We present a lexicon-less rule-based machine translation system from English to Czech, based on a very limited amount of transformation rules. Its core is a novel translation module, implemented as a component of the TectoMT translation system, and depends massively on the extensive pipeline of linguistic preprocessing and postprocessing within TectoMT. Its scope is naturally limited, but for specific texts, e.g. from the scientific or marketing domain, it occasionally produces sensible results. Prezentujeme lexikon-lesový rule-bazovaný systém machín translace od Engliše Cˇecha, který bazoval na veroveˇ limitované amountu rulu˚ transformace. Jeho kor je novelový modul translace, implementovalo jako komponent systému translace tektomtu a dependuje masivneˇ na extensivní pipelínu lingvistické preprocesování a postprocesovat v Tektomtu. Jeho skop je naturálneˇ limitovaná, ale pro specifické texty z naprˇíklad scientifické nebo marketování doménu okasionálneˇ producuje sensibilní resulty.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In this work, we present Czechizator, a lexicon-less
rulebased machine translation system from English to Czech.</p>
      <p>
        Lexicon-less approach to machine translation has
already been successfuly applied to closely related
languages – e.g. the Czech-Slovak machine translation
system Cˇ esílko [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] featured a rule-based lexicon-less
transformation component for handling OOV
(out-ofvocabulary) words. For transliteration, which can be
thought of as a low-level translation, rule-based systems
are also common. However, in this work, we decided to
tackle a harder problem: to use a similar approach for a
full translation between a pair of only weakly related
languages, namely English and Czech.
      </p>
      <p>
        While we believe that it is impossible to achieve
highquality or even reasonable-quality general-domain
translation without a large lexicon, we attempt to investigate to
what degree this is possible if the domain is somewhat
special. Specifically, we target the domain of scientific texts
(or, more precisely, abstracts of scientific papers), which
contain a large amount of terms that tend to be rather
similar even across more distant languages. In this way, we
operate on a pair of languages which are typologically
different but lexically close. Moreover, we crucially rely on
the strong linguistic abstractions provided by the TectoMT
machine translation system [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], which boasts to operate
on a deep layer of language representation where
typological differences of languages become quite transparent,
as the meaning itself, rather than the form, is captured.
Abstracting away from both lexical and typological
differences in this way, a smallish set of rules and heuristics
should be sufficient to obtain a competitive machine
translation system.
      </p>
      <p>While the main focus of our work is to test the degree
to which the aforementioned hypothesis is valid, our work
has practical implications as well. The number of terms
used in scientific texts is enormous, many of them being
rare in parallel corpora or even newly created and thus
bound to constitute OOV items for machine translation
systems. However, as there seems to be some
regularity in the way that English terms are adapted in Czech, it
should be possible to use a lexicon-less system as an
additional component in a standard machine translation system
to handle OOVs. It may also be beneficial in scenarios
where low-quality but light-weight translation system is
preferred over a full-fledged but resource-heavy system.1</p>
      <p>Another use-case is machine-aided translation of
sientific paper abstracts, as the Czechizator output should often
be a good starting point for creating the final translation by
post-editing.</p>
      <p>Before explaining the approach we used to implement
the translation model, we present a set of three sample
outputs of Czechizator, applied to abstracts of two scientific
papers (Table 1, Table 2), and one marketing text,2
(Table 3). Also, as an additional example, the abstract of this
paper is provided both in English and in its Czechization.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <sec id="sec-2-1">
        <title>TectoMT</title>
        <p>
          TectoMT [
          <xref ref-type="bibr" rid="ref1 ref15">15, 1</xref>
          ] is a highly modular linguistically oriented
machine translation system, featuring a deep-linguistic
three-step processing pipeline of analysis, transfer, and
1However, TectoMT itself is rather resource-heavy even when the
lexical models are omitted, so even though the component that we
implemented is very light-weight, the complete system that it relies on is not –
using the Czechizator model instead of the base models in TectoMT only
brings a 15% speedup and 40% RAM cut, which is probably not worth
the quality drop in any realistic scenario.
        </p>
        <p>2The text was obtained from https://www.accenture.com/
cz-en/strategy-index</p>
        <sec id="sec-2-1-1">
          <title>Source</title>
          <p>Chimera is a machine translation system
that combines the TectoMT
deeplinguistic core with Moses phrase-based
MT system. For English–Czech pair
it also uses the Depfix post-correction
system. All the components run on
Unix/Linux platform and are open
source (available from CPAN Perl
repository and the LINDAT/CLARIN
repository). The main website is
https://ufal.mff.cuni.cz/tectomt. The
development is currently supported by the
QTLeap 7th FP project (http://qtleap.eu).
Czechization
Chimera je systém machín translace,
který kombinuje díp-lingvistické kor
tektomtu z fraze-bazovaného MT systému
mozesu. Pro Engliše – cˇechová pér
také uzuje systém post-korekce
Depfix. Všechny komponenty runují v
Unix / platformu Linuxu a jsou
openová sourc (avélabilní z CPAN Perla
repositorie a LINDAT / CLARIN
repositorie). Hlavní webová stránka je
https://ufal.mff.cuni.cz/tectomt.
Development kurentneˇ je suport FP projektem
7th qtlípu (http://qtleap.eu).</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Reference translation</title>
          <p>Chimera systém strojového prˇekladu,
který kombinuje hluboce lingvistické
jádro TectoMT s frázovým strojovým
prˇekladacˇem Moses. Pro anglicko-cˇeský
prˇeklad také používá post-editovací
systém Depfix. Všechny komponenty
beˇží na platformeˇ Unix/Linux a jsou
open-source (dostupné z Perlového
repozitárˇe CPAN a repozitárˇe
LINDAT/CLARIN). Hlavní webová stránka
je https://ufal.mff.cuni.cz/tectomt. Vývoj
je momentálneˇ podporován projektem
QTLeap ze 7th FP (http://qtleap.eu).</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Source</title>
          <p>We propose two novel model architectures for computing
continuous vector representations of words from very large data
sets. The quality of these representations is measured in a word
similarity task, and the results are compared to the previously
best performing techniques based on different types of neural
networks. We observe large improvements in accuracy at much
lower computational cost, i.e. it takes less than a day to learn
high quality word vectors from a 1.6 billion words data set.
Furthermore, we show that these vectors provide state-of-the-art
performance on our test set for measuring syntactic and
semantic word similarities.</p>
        </sec>
        <sec id="sec-2-1-4">
          <title>Czechization</title>
          <p>Propozujeme 2 novelová architektury modelu˚, že komputují
kontinuální reprezentace vektoru˚ vordu˚ od veroveˇ largových
setu˚ dat. Kvalita teˇchto reprezentací je mísur ve vord similarita
tasku a resulty jsou kompar s previálneˇ nejgu˚dovšími,
performují, techniky, kterˇí bazovali na diferentových typech
neurálních netvorku˚. Observujeme largové improvementy akurace v
muchoveˇ lovovší komputacionální kosti, tj. takuje méneˇ než
Daie, aby se lírnovalo hajové vektory vordu kvality z dat vordu˚
1.6 bilionu, která setovala. Furtermoroveˇ šovujeme, že tyto
vektory providují state-of-te-artovou performance na našem testu,
který setoval, že mísurují syntaktické a semantické vord
similarity.</p>
        </sec>
        <sec id="sec-2-1-5">
          <title>Source</title>
          <p>Accenture Operations combines technology that digitizes and
automates business processes, unlocks actionable insights, and
delivers everything-as-a-service with our team’s deep industry,
functional and technical expertise. So you can confidently chart
your course to consuming your core business services on
demand, accelerate innovation and speed to market. Welcome to
the "as-a-service" business revolution.</p>
          <p>Accenture Strategy shapes our clients’ future, combining deep
business insight with the understanding of how technology will
impact industry and business models. Our focus on issues
related to digital disruption, redefining competitiveness, operating
and business models as well as the workforce of the future helps
our clients find future value and growth in a digital world.
Whether focused on strategies for business, technology or
operations, Accenture Strategy has the people, skills and
experience to effectively shape client value. We offer highly objective
points of view on C-suite themes, with an emphasis on
business and technology, leveraging our deep industry experience.
That’s high performance, delivered.</p>
        </sec>
        <sec id="sec-2-1-6">
          <title>Czechization</title>
          <p>Operacions acenturu kombinuje technologii, která digitizuje a
automuje procesy businosti, unlokuje akcionabilní insajty a
deliveruje everyting-as-a-servicová s funkcionální a technickou
expertizou dípové industrie našeho tímu. Tak konfidentneˇ
mu˚žete chartovat svu˚j kours, konsumuje vaše service businosti
kor na demandu, aceleratové inovaci a spídu marketu.
Velkomujte „as-a-service“ revoluce businosti.</p>
          <p>
            Strategie acenturu šapuje futur našich klientu˚, kombinuje
dípovou insajt businosti s understandováním, jak technologie
impaktuje a industrie businosti modely. Náš fokus na isu, kterˇí
relovali s digitálním disrupcí, kterˇí redefinují kompetitivnost,
operatování a businost modely, i vorkforc futur helpuje, naši
klienti findují futurovou valu a grovt v digitální vorldu.
Vhetr fokusoval na strategie pro businost, technologie nebo
operací strategii acenturu, má peoply, skily a experience, aby
efektivneˇ šapovali valu klienta. Oferujeme hajneˇ objektivní pointy
vievu na k-suitových temech s emfasí na businost a technologii,
leveraguje naši dípovou experience industrie. Které je hajová
performanc, který deliveroval.
synthesis. TectoMT is implemented in Treex [
            <xref ref-type="bibr" rid="ref13 ref8">8, 13</xref>
            ],
using a representaion of language based on the Functional
Generative Description [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].
          </p>
          <p>The first step in the translation pipeline is to perform a
lingustic analysis of each source (input) sentence up to
tlayer, obtaining a deep-syntactic representation of the
sentence (t-tree). On t-layer, each full (autosemantic) word is
represented by a t-node with a t-lemma and a set of
linguistic t-attributes (such as functor, formeme, number, gender,
deep tense) that capture the function of the word.
Inflections and auxiliary words are not explictly represented, but
their functions are captured by the attributes of the t-nodes.</p>
          <p>Each source t-tree is then isomorphically transferred
to a target t-tree. In the standard TectoMT setup, the
tlemma of each t-node is translated by models that have
been trained on large parallel data. The other t-attributes
are then transferred by a pipeline featuring both rule-based
and machine-learned steps.</p>
          <p>
            Finally, the target sentence is synthesized from the
ttree. This step relies heavily on a morphological generator
[
            <xref ref-type="bibr" rid="ref12">12</xref>
            ], which is able to generate a word form based on the
word lemma and a set of morphological feature values. For
the highly flective Czech language, this is a challenging
task; even though we employ a state-of-the-art generator, it
is sometimes unable to generate the requested word form,
especially when the lemma is unknown to the generator.
          </p>
          <p>
            TectoMT can (and does by default) use a weighted
interpolation of multiple translation models to generate
translation candidates [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]. This makes it easy to replace or
complement the existing models with new models, such
as our Czechizator model.
2.2
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Czechizator translation model</title>
        <p>The Czechizator translation model attempts to Czechize
each English t-lemma, unless it is marked as a named
entity. To Czechize the lemma, it applies the following
resources, which we manually constructed:
• a shortlist of 36 lemma translations, focusing on
words that we believe to be auxiliaries rather than
full words (and thus presumably should be dropped
by the t-analysis and represented by t-attributes,
but in fact constitute t-lemmas),3 and on cardinal
numbers (which presumably should be converted to
a language-independent representation by TectoMT
analysis, but are not),
• a set of 43 transformation rules based on semantic
part of speech of the t-node and the ending of its
tlemma (noun rules are provided as an example in
Table 4), and
• a transliteration table, consisting of 33 transliteration
rules.4
English ending
-sion
-tion
-ison
-ness
-ise
-ize
-em
-er
-ty
-is
-in
-ine
-ing
-cy
-y</p>
        <p>Czechized ending
-se
-ce
-ace
-nost
-iza
-iza
-ém
-r
-ta
-e
-ín
-ín
-ování
-ce
-ie</p>
        <p>
          The transformations are generally applied sequentially,
but forking is possible at some places, and so multiple
alternative Czechizations may be generated; TectoMT uses
a Hidden Markov Tree Model [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] (instead of a
language model) to eventually select the best combination
of t-lemmas (and other t-attributes). However, as the
Czechizations are usually OOVs for the HMTM, typically
the first candidate gets selected. The target semantic
partof-speech identifier is also generated, based on the source
semantic part-of-speech and the t-lemma ending; this is
important for the subsequent synthesis steps.
        </p>
        <p>
          It should be noted that the current implementation of
Czechizator is rather a proof-of-concept than an attempt
on a professional translation model. If one was to follow
this research path in future, it would be presumably more
appropriate to learn the regular transformations from
parallel (or comparable) corpora, extracting pairs of similar
words that are translations of each other and generalizing
the transformation necessary to convert one into the other,
as well as learning to identify the cases in which a
transformation should be applied. Similar methods could be used
as were applied e.g. in the semi-supervised morphological
generator Flect [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>Czechizator uses the standard TectoMT translation
model interface, and can thus be easily and seamlessly
plugged into the standard TectoMT pipeline, either
replacing or complementing the base lexical translation models.
3be, have, do, and, or, but, therefore, that, who, which, what, why,
how, each, other, then, also, so, as, all, this, these, many, only, main,
mainly</p>
        <p>4As an example, we list several of the transliteration rules here:
th→t, ti→ci, ck→k, ph→f, sh→š, ch→ch, cz→cˇ, qu→kv, igh→aj,
gh→ch, gu→gv, dg→dž, w→v, c→k.
Ending
-ovat
-ání
-í
-ý
-o
-e
-a
-ost
-eˇ
-h, k, r, d, t, n, b, f, l, m, p, s, v, z
-ž, š, rˇ, cˇ, c, j, d’, t’, nˇ
Surrogate lemma
kupovat
plavání
jarní
mladý
meˇsto
ru˚že
žena
kost
mladeˇ
svrab
muž
this reason, we enriched the word form generation
component of TectoMT5 with a last-resort inflection step.6 If the
morphogenerator is unable to generate the inflection, we
use a set of simple ending-based rules to find a surrogate
lemma, as listed in Table 5,7 inflect the surrogate lemma,
strip its ending, and apply it to the target lemma. We
focus on endings generated by the Czechizator translation
module, but we aimed for high coverage, and successfully
managed to employ the last-resort inflector even into the
base TectoMT translation.</p>
        <p>For example, if one is to inflect the pseudo-adjective
“largový” (Czechization of “large”) for the feminine
accusative, we replace it with the surrogate lemma (“mladý”)
that corresponds to its ending (“-ý”), obtain its
feminine accusative inflection from the morphogenerator
(“mladou”), strip the matched ending from both of the
lemmas, obtaining pseudo-stems (“largov”, “mlad”), strip
the surrogate pseudo-stem (“mlad”) from the surrogate
inflection (“mladou”) to obtain the inflection ending (“-ou”),
and join the ending with the target pseudo-stem (“largov”)
to obtain the target inflection (“largovou”).
3
3.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <sec id="sec-3-1">
        <title>Dataset</title>
        <p>To automatically evaluate the translation quality by
standard methods, we collected a small dataset, consisting of
Czech and English abstracts of scientific papers.
Specifically, we collected the abstracts of papers of authors from
5https://github.com/ufal/treex/blob/master/lib/
Treex/Block/T2A/CS/GenerateWordforms.pm</p>
        <p>6https://github.com/ufal/treex/commit/
363d1b18f7140e0cb687ed8deebc4ac4a1051080</p>
        <p>7Although there exists a set of commonly used lemmas to represent
the basic Czech paradigms, we sometimes use a different lemma – to
avoid unnecessary ambiguity, and to simplify the application of the
ending to the target lemma (we avoid surrogate lemmas that exhibit changes
on the root during inflection).</p>
        <sec id="sec-3-1-1">
          <title>Setup</title>
          <p>Untranslated source
No model
Czechizator
Base TectoMT
Base + Czechizator
the Institute of Formal and Applied Linguistics at Charles
University in Prague, who are obliged to provide both a
Czech and an English abstract for each of their
publications. These are then stored in the institute’s database of
publications, Biblio,8, and can be accessed through a
regularly generated XML dump.9.</p>
          <p>
            The collected parallel corpus, aligned on the document
level, e.g. on individual abstracts, contains 1,556 pairs of
abstracts, totalling 121,386 words on the English side and
76,812 words on the Czech side.10 We did not perform
any filtering of the data, apart from filtering out
incomplete entries (missing the Czech or the English abstract)
and replacing newlines and tabulators by spaces (solely for
technical reasons). The dataset is publicly available [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ].
Automatic evaluation with BLEU and NIST was
performed with the MTrics tool [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. We evaluated several
candidate translations: the untranslated English source
texts, TectoMT with no lexical model, TectoMT with the
Czechizator model, TectoMT with an interpolation of its
base lexical models (the default setup of TectoMT), and
TectoMT with an interpolation of Czechizator and the base
lexical models.
          </p>
          <p>While translation quality of the Czechizator outputs is
clearly well below the base TectoMT system, the results
show that Czechizator does manage to produce some
useful output – its scores are significantly higher than that of
TectoMT with no lexical translation model. This shows
that lexicon-less translation is somewhat possible in our
setting, although on average it is far from competitive – at
least with the current version of Czechizator, which is a
rather basic proof-of-concept implementation, lacking
numerous simple and obvious improvements that could
easily be performed and would presumably lead to further
significant increases of translation quality. However, as with
many rule-based systems for natural language processing,
the code complexity and especially the amount of manual
tuning necessary to push the performance further and
further is likely to grow very quickly.</p>
          <p>8http://ufal.mff.cuni.cz/biblio/
9https://svn.ms.mff.cuni.cz/trac/biblio/browser/
trunk/xmldump</p>
          <p>10The difference in the sizes is partially caused by the fact that
usually, the English abstract is the full original, and its Czech translation is
often shortened considerably by the authors.</p>
          <p>Manual inspection of the outputs (see also the examples
in the beginning of this paper) showed that the chosen
domain is quite suitable for lexicon-less translation, but the
proportion of autosemantic words that cannot be simply
transformed from English to Czech without a lexicon is
still rather high – high enough to make many of the
sentences barely comprehensible. We therefore acknowledge
that at least a small lexicon would be necessary to obtain
reasonable translations for most sentences. On the other
hand, we observed many phrases, and occasionally even
whole sentences, whose Czechizations were of a rather
high quality and understandable to Czech speakers with
minor or no difficulties. We thus find our approach
interesting and potentially promising, although we believe
that the amount of work needed to bring the system to a
competitive level of translation quality would be by
several orders of magnitude larger than that spent on
creating the current system (which took less than one
personweek). Still, we expect that for the given domain,
developing such a rule-based system would constitute many times
less work than building an open-domain system.</p>
          <p>Thanks to the deep analysis and generation provided by
TectoMT, the Czechizations tend to be rather
grammatical, with words correctly inflected, even if non-sensical.
Unfortunately, even grammatical errors occur rather
frequently – some words are not inflected at all, some violate
morphological agreement (e.g. in gender, case or number),
etc. This can be explained by realizing that the complex
TectoMT pipeline consists of many subcomponents, each
operating with a certain precision, occasionally producing
erroneous analyses. The most crucial stage seems to be
syntactic parsing, which has been reported to have only
approximately 85% accuracy, i.e. roughly 15% of
dependency relations are assigned incorrectly; these typically
manifest themselves as agreement errors in the
Czechization output.</p>
          <p>
            Evaluation of the main potential use case of
Czechizator, i.e. complementing base TectoMT translation
models for OOVs (Base + Czechizator setup), brought mixed
results. There is a small deterioration in the automatic
scores, and subsequent manual inspection showed that
Czechizator can target OOVs only semi-sucessfully. It
can offer a Czechization of any OOV term, which is
often correct (e.g. “anafora” for English “anaphora”,
“interlingvální” for “interlingual”, “hypotaktický” for
“hypotactical”, or “cirkumfixální” for “circumfixal”), but
sometimes the Czechization is not correct (e.g. “businost” for
“business”, “hands-onový” for “hands-on”, or “kolokaty”
for “collocations”). In many cases, a Czechization of the
term is simply not used in practice, and is less
understandable to the reader than the original English form (e.g.
“kejnotový” for “keynote”, “veb-pagová” for “web-page”,
“part-of-spích” for “part-of-speech”, or “kros-langvaž” for
“cross-language”). Czechizator also often generates a
form that is plausible but rarely or never used, although
one may think that the Czechized form may become the
standard Czech translation in future, and is mostly
understandable to readers (e.g. “tríbank” for “treebank”, “tvít”
for “tweet”, or “kros-lingvální” for “cross-lingual” – here
the base models generated a rather nonsensical “lingual
krˇíže”). Unfortunately, it also often Czechizes named
entities, even though we explicitly avoid them if they are
marked by the analysis; this seems to be primarily a
shortcoming (or unsuitability for this task) of the named
entity recognizer used [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ], which seems to favour
precision over recall. Still, Czechizator can sometimes provide
a better translation than the base models, even in cases
where the term is not an OOV – such as the word
“postediting”, which the base models translate into a
confusing “poúprava”, while Czechizator provides an acceptable
translation “post-editování”.11
          </p>
          <p>In general, we believe that, if appropriate attention is
paid to the identified issues, such as named entities
avoidance, Czechizator has the potential of usefully
complementing the base TectoMT translation models, especially
in handling OOV terms.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We implemented a rule-based lexicon-less English-Czech
translation model into TectoMT, called Czechizator. The
model is based on a set of simple rules, mainly
following regularities in adoption of English terms into Czech.
Czechizator has been especially designed for and applied
to the domain of abstracts of scientific papers, but also
provides interesting results for texts from the marketing
domain.</p>
      <p>We automatically evaluated Czechizator on a collection
of abstracts of computational linguistics papers, showing
inferior but promising results in comparison with the base
TectoMT models; the highest observed potential is in
employing Czechizator as an additional TectoMT translation
model for out-of-vocabulary items.</p>
      <p>Czechizator is released as an open-source Treex module
in the main Treex repository on Github,12 and is also made
available as an online demo.13</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This research was supported by the grants GAUK
1572314, and SVV 260 333. This work has been using
language resources and tools developed, stored and
distributed by the LINDAT/CLARIN project of the Ministry
of Education, Youth and Sports of the Czech Republic
(project LM2015071).</p>
      <p>11Other such examples include the Czechization
“reimplementace” for “reimplementation” instead of “znovuprovádeˇní”, or
“postnominální” for “post-nominal” instead of “pojmenovitý”.</p>
      <p>12https://github.com/ufal/treex/blob/master/lib/
Treex/Tool/TranslationModel/Rulebased/Model.pm
13http://ufallab.ms.mff.cuni.cz/~rosa/czechizator/
input.php</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Ondrˇej</given-names>
            <surname>Dušek</surname>
          </string-name>
          , Luís Gomes, Michal Novák,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Popel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Rudolf</given-names>
            <surname>Rosa</surname>
          </string-name>
          .
          <article-title>New language pairs in tectoMT</article-title>
          .
          <source>In Proceedings of the 10th Workshop on Machine Translation</source>
          , pages
          <fpage>98</fpage>
          -
          <lpage>104</lpage>
          , Stroudsburg, PA, USA,
          <year>2015</year>
          .
          <article-title>Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Ondrˇej</given-names>
            <surname>Dušek and Filip Jurcˇícˇek</surname>
          </string-name>
          .
          <article-title>Training a natural language generator from unaligned data</article-title>
          .
          <source>In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)</source>
          , pages
          <fpage>451</fpage>
          -
          <lpage>461</lpage>
          , Stroudsburg, PA, USA,
          <year>2015</year>
          .
          <article-title>Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Jan</surname>
            <given-names>Hajicˇ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vladislav</surname>
            <given-names>Kubonˇ</given-names>
          </string-name>
          , and Jan Hric.
          <article-title>Cˇ esílko - an MT system for closely related languages</article-title>
          .
          <source>In ACL2000, Tutorial Abstracts and Demonstration Notes</source>
          , pages
          <fpage>7</fpage>
          -
          <lpage>8</lpage>
          . ACL,
          <source>ISBN 1-55860-730-7</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Petr</given-names>
            <surname>Homola</surname>
          </string-name>
          and
          <article-title>Vladislav Kubonˇ</article-title>
          .
          <source>Cˇesílko 2</source>
          .0,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Kamil</given-names>
            <surname>Kos</surname>
          </string-name>
          .
          <article-title>Adaptation of new machine translation metrics for Czech</article-title>
          .
          <source>Bachelor's thesis</source>
          , Charles University in Prague,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Efficient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Popel</surname>
          </string-name>
          , Roman Sudarikov, Ondrˇej Bojar, Rudolf Rosa, and
          <article-title>Jan Hajicˇ. TectoMT - a deep-linguistic core of the combined chimera MT system</article-title>
          .
          <source>Baltic Journal of Modern Computing</source>
          ,
          <volume>4</volume>
          (
          <issue>2</issue>
          ):
          <fpage>377</fpage>
          -
          <lpage>377</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Popel and Zdeneˇk Žabokrtský</surname>
          </string-name>
          .
          <article-title>TectoMT: Modular NLP framework</article-title>
          . In Hrafn Loftsson, Eirikur Rögnvaldsson, and Sigrun Helgadottir, editors,
          <source>Lecture Notes in Artificial Intelligence, Proceedings of the 7th International Conference on Advances in Natural Language Processing (IceTAL</source>
          <year>2010</year>
          ), volume
          <volume>6233</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>293</fpage>
          -
          <lpage>304</lpage>
          , Berlin / Heidelberg,
          <year>2010</year>
          .
          <article-title>Iceland Centre for Language Technology (ICLT</article-title>
          ), Springer.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Rudolf</given-names>
            <surname>Rosa</surname>
          </string-name>
          .
          <source>Czech and English abstracts of ÚFAL papers</source>
          ,
          <year>2016</year>
          . LINDAT/CLARIN digital library at Institute of Formal and Applied Linguistics, Charles University in Prague.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Rudolf</surname>
            <given-names>Rosa</given-names>
          </string-name>
          , Ondrˇej Dušek, Michal Novák, and
          <string-name>
            <given-names>Martin</given-names>
            <surname>Popel</surname>
          </string-name>
          .
          <article-title>Translation model interpolation for domain adaptation in TectoMT</article-title>
          . In Jan Hajicˇ and António Branco, editors,
          <source>Proceedings of the 1st Deep Machine Translation Workshop</source>
          , pages
          <fpage>89</fpage>
          -
          <lpage>96</lpage>
          , Praha, Czechia,
          <year>2015</year>
          .
          <article-title>ÚFAL MFF UK, ÚFAL MFF UK</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Petr</surname>
            <given-names>Sgall</given-names>
          </string-name>
          , Eva Hajicˇová, and
          <string-name>
            <given-names>Jarmila</given-names>
            <surname>Panevová</surname>
          </string-name>
          .
          <article-title>The meaning of the sentence in its semantic and pragmatic aspects</article-title>
          . Springer,
          <year>1986</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Jana</surname>
            <given-names>Straková</given-names>
          </string-name>
          , Milan Straka, and Jan Hajicˇ.
          <article-title>Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition</article-title>
          .
          <source>In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations</source>
          , pages
          <fpage>13</fpage>
          -
          <lpage>18</lpage>
          , Baltimore, Maryland,
          <year>June 2014</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Zdeneˇk</given-names>
            <surname>Žabokrtský</surname>
          </string-name>
          .
          <article-title>Treex - an open-source framework for natural language processing</article-title>
          . In Markéta Lopatková, editor,
          <source>ITAT</source>
          , volume
          <volume>788</volume>
          , pages
          <fpage>7</fpage>
          -
          <lpage>14</lpage>
          , Košice, Slovakia,
          <year>2011</year>
          . Univerzita Pavla Jozefa Šafárika v Košiciach.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Zdeneˇk</given-names>
            <surname>Žabokrtský and Martin Popel</surname>
          </string-name>
          .
          <article-title>Hidden Markov tree model in dependency-based machine translation</article-title>
          .
          <source>In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers</source>
          , pages
          <fpage>145</fpage>
          -
          <lpage>148</lpage>
          , Suntec, Singapore,
          <year>2009</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Zdeneˇk</given-names>
            <surname>Žabokrtský</surname>
          </string-name>
          , Jan Ptácˇek, and Petr Pajas.
          <article-title>TectoMT: Highly modular MT system with tectogrammatics used as transfer layer</article-title>
          .
          <source>In ACL 2008 WMT: Proceedings of the Third Workshop on Statistical Machine Translation</source>
          , pages
          <fpage>167</fpage>
          -
          <lpage>170</lpage>
          , Columbus,
          <string-name>
            <surname>OH</surname>
          </string-name>
          , USA,
          <year>2008</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>