=Paper= {{Paper |id=Vol-1649/74 |storemode=property |title=Czechizator – Čechizátor |pdfUrl=https://ceur-ws.org/Vol-1649/74.pdf |volume=Vol-1649 |authors=Rudolf Rosa |dblpUrl=https://dblp.org/rec/conf/itat/Rosa16 }} ==Czechizator – Čechizátor== https://ceur-ws.org/Vol-1649/74.pdf
ITAT 2016 Proceedings, CEUR Workshop Proceedings Vol. 1649, pp. 74–79
http://ceur-ws.org/Vol-1649, Series ISSN 1613-0073, c 2016 R. Rosa



                                                 Czechizator – Čechizátor

                                                               Rudolf Rosa

                                     Charles University in Prague, Faculty of Mathematics and Physics,
                                               Institute of Formal and Applied Linguistics,
                                         Malostranské náměstí 25, 118 00 Prague, Czech Republic
                                                        rosa@ufal.mff.cuni.cz

      Abstract: We present a lexicon-less rule-based machine            on a deep layer of language representation where typo-
      translation system from English to Czech, based on a very         logical differences of languages become quite transparent,
      limited amount of transformation rules. Its core is a novel       as the meaning itself, rather than the form, is captured.
      translation module, implemented as a component of the             Abstracting away from both lexical and typological dif-
      TectoMT translation system, and depends massively on              ferences in this way, a smallish set of rules and heuristics
      the extensive pipeline of linguistic preprocessing and post-      should be sufficient to obtain a competitive machine trans-
      processing within TectoMT. Its scope is naturally limited,        lation system.
      but for specific texts, e.g. from the scientific or marketing         While the main focus of our work is to test the degree
      domain, it occasionally produces sensible results.                to which the aforementioned hypothesis is valid, our work
         Prezentujeme lexikon-lesový rule-bazovaný systém               has practical implications as well. The number of terms
      machín translace od Engliše Čecha, který bazoval na              used in scientific texts is enormous, many of them being
      verově limitované amountu rulů transformace. Jeho kor je        rare in parallel corpora or even newly created and thus
      novelový modul translace, implementovalo jako kompo-              bound to constitute OOV items for machine translation
      nent systému translace tektomtu a dependuje masivně na           systems. However, as there seems to be some regular-
      extensivní pipelínu lingvistické preprocesování a postpro-        ity in the way that English terms are adapted in Czech, it
      cesovat v Tektomtu. Jeho skop je naturálně limitovaná, ale       should be possible to use a lexicon-less system as an addi-
      pro specifické texty z například scientifické nebo marke-        tional component in a standard machine translation system
      tování doménu okasionálně producuje sensibilní resulty.          to handle OOVs. It may also be beneficial in scenarios
                                                                        where low-quality but light-weight translation system is
                                                                        preferred over a full-fledged but resource-heavy system.1
      1    Introduction and Motivation                                      Another use-case is machine-aided translation of sien-
                                                                        tific paper abstracts, as the Czechizator output should often
      In this work, we present Czechizator, a lexicon-less rule-        be a good starting point for creating the final translation by
      based machine translation system from English to Czech.           post-editing.
         Lexicon-less approach to machine translation has al-               Before explaining the approach we used to implement
      ready been successfuly applied to closely related lan-            the translation model, we present a set of three sample out-
      guages – e.g. the Czech-Slovak machine translation                puts of Czechizator, applied to abstracts of two scientific
      system Česílko [3, 4] featured a rule-based lexicon-less         papers (Table 1, Table 2), and one marketing text,2 (Ta-
      transformation component for handling OOV (out-of-                ble 3). Also, as an additional example, the abstract of this
      vocabulary) words. For transliteration, which can be              paper is provided both in English and in its Czechization.
      thought of as a low-level translation, rule-based systems
      are also common. However, in this work, we decided to
      tackle a harder problem: to use a similar approach for a          2     Approach
      full translation between a pair of only weakly related lan-
      guages, namely English and Czech.                                 2.1    TectoMT
         While we believe that it is impossible to achieve high-
                                                                        TectoMT [15, 1] is a highly modular linguistically oriented
      quality or even reasonable-quality general-domain trans-
                                                                        machine translation system, featuring a deep-linguistic
      lation without a large lexicon, we attempt to investigate to
                                                                        three-step processing pipeline of analysis, transfer, and
      what degree this is possible if the domain is somewhat spe-
      cial. Specifically, we target the domain of scientific texts
      (or, more precisely, abstracts of scientific papers), which            1 However, TectoMT itself is rather resource-heavy even when the
      contain a large amount of terms that tend to be rather sim-       lexical models are omitted, so even though the component that we imple-
      ilar even across more distant languages. In this way, we          mented is very light-weight, the complete system that it relies on is not –
      operate on a pair of languages which are typologically dif-       using the Czechizator model instead of the base models in TectoMT only
                                                                        brings a 15% speedup and 40% RAM cut, which is probably not worth
      ferent but lexically close. Moreover, we crucially rely on        the quality drop in any realistic scenario.
      the strong linguistic abstractions provided by the TectoMT             2 The text was obtained from https://www.accenture.com/

      machine translation system [15], which boasts to operate          cz-en/strategy-index
Czechizator – Čechizátor                                                                                                                          75



        Source                                        Czechization                                  Reference translation
        Chimera is a machine translation system       Chimera je systém machín translace,           Chimera systém strojového překladu,
        that combines the TectoMT deep-               který kombinuje díp-lingvistické kor tek-     který kombinuje hluboce lingvistické
        linguistic core with Moses phrase-based       tomtu z fraze-bazovaného MT systému           jádro TectoMT s frázovým strojovým
        MT system. For English–Czech pair             mozesu. Pro Engliše – čechová pér            překladačem Moses. Pro anglicko-český
        it also uses the Depfix post-correction       také uzuje systém post-korekce Dep-           překlad také používá post-editovací
        system. All the components run on             fix.     Všechny komponenty runují v          systém Depfix. Všechny komponenty
        Unix/Linux platform and are open              Unix / platformu Linuxu a jsou open-          běží na platformě Unix/Linux a jsou
        source (available from CPAN Perl              ová sourc (avélabilní z CPAN Perla            open-source (dostupné z Perlového
        repository and the LINDAT/CLARIN              repositorie a LINDAT / CLARIN repos-          repozitáře CPAN a repozitáře LIN-
        repository).      The main website is         itorie).     Hlavní webová stránka je         DAT/CLARIN). Hlavní webová stránka
        https://ufal.mff.cuni.cz/tectomt. The de-     https://ufal.mff.cuni.cz/tectomt. Devel-      je https://ufal.mff.cuni.cz/tectomt. Vývoj
        velopment is currently supported by the       opment kurentně je suport FP projektem       je momentálně podporován projektem
        QTLeap 7th FP project (http://qtleap.eu).     7th qtlípu (http://qtleap.eu).                QTLeap ze 7th FP (http://qtleap.eu).

                 Table 1: Abstract of a scientific paper [7], its Czechization, and a reference translation by its author.



        Source                                                               Czechization
        We propose two novel model architectures for computing con-          Propozujeme 2 novelová architektury modelů, že komputují
        tinuous vector representations of words from very large data         kontinuální reprezentace vektorů vordů od verově largových
        sets. The quality of these representations is measured in a word     setů dat. Kvalita těchto reprezentací je mísur ve vord similarita
        similarity task, and the results are compared to the previously      tasku a resulty jsou kompar s previálně nejgůdovšími, perfor-
        best performing techniques based on different types of neural        mují, techniky, kteří bazovali na diferentových typech neurál-
        networks. We observe large improvements in accuracy at much          ních netvorků. Observujeme largové improvementy akurace v
        lower computational cost, i.e. it takes less than a day to learn     muchově lovovší komputacionální kosti, tj. takuje méně než
        high quality word vectors from a 1.6 billion words data set. Fur-    Daie, aby se lírnovalo hajové vektory vordu kvality z dat vordů
        thermore, we show that these vectors provide state-of-the-art        1.6 bilionu, která setovala. Furtermorově šovujeme, že tyto vek-
        performance on our test set for measuring syntactic and seman-       tory providují state-of-te-artovou performance na našem testu,
        tic word similarities.                                               který setoval, že mísurují syntaktické a semantické vord simi-
                                                                             larity.

                                       Table 2: Abstract of a scientific paper [6] and its Czechization.



        Source                                                               Czechization
        Accenture Operations combines technology that digitizes and          Operacions acenturu kombinuje technologii, která digitizuje a
        automates business processes, unlocks actionable insights, and       automuje procesy businosti, unlokuje akcionabilní insajty a de-
        delivers everything-as-a-service with our team’s deep industry,      liveruje everyting-as-a-servicová s funkcionální a technickou
        functional and technical expertise. So you can confidently chart     expertizou dípové industrie našeho tímu. Tak konfidentně
        your course to consuming your core business services on de-          můžete chartovat svůj kours, konsumuje vaše service businosti
        mand, accelerate innovation and speed to market. Welcome to          kor na demandu, aceleratové inovaci a spídu marketu. Velko-
        the "as-a-service" business revolution.                              mujte „as-a-service“ revoluce businosti.
        Accenture Strategy shapes our clients’ future, combining deep        Strategie acenturu šapuje futur našich klientů, kombinuje
        business insight with the understanding of how technology will       dípovou insajt businosti s understandováním, jak technologie
        impact industry and business models. Our focus on issues re-         impaktuje a industrie businosti modely. Náš fokus na isu, kteří
        lated to digital disruption, redefining competitiveness, operating   relovali s digitálním disrupcí, kteří redefinují kompetitivnost,
        and business models as well as the workforce of the future helps     operatování a businost modely, i vorkforc futur helpuje, naši
        our clients find future value and growth in a digital world.         klienti findují futurovou valu a grovt v digitální vorldu.
        Whether focused on strategies for business, technology or op-        Vhetr fokusoval na strategie pro businost, technologie nebo op-
        erations, Accenture Strategy has the people, skills and experi-      erací strategii acenturu, má peoply, skily a experience, aby efek-
        ence to effectively shape client value. We offer highly objective    tivně šapovali valu klienta. Oferujeme hajně objektivní pointy
        points of view on C-suite themes, with an emphasis on busi-          vievu na k-suitových temech s emfasí na businost a technologii,
        ness and technology, leveraging our deep industry experience.        leveraguje naši dípovou experience industrie. Které je hajová
        That’s high performance, delivered.                                  performanc, který deliveroval.

                                   Table 3: A marketing text from Accenture.com and its Czechization.
76                                                                                                                                             R. Rosa

     synthesis. TectoMT is implemented in Treex [8, 13], us-                           English ending       Czechized ending
     ing a representaion of language based on the Functional                           -sion                -se
     Generative Description [11].                                                      -tion                -ce
                                                                                       -ison                -ace
        The first step in the translation pipeline is to perform a
                                                                                       -ness                -nost
     lingustic analysis of each source (input) sentence up to t-                       -ise                 -iza
     layer, obtaining a deep-syntactic representation of the sen-                      -ize                 -iza
     tence (t-tree). On t-layer, each full (autosemantic) word is                      -em                  -ém
     represented by a t-node with a t-lemma and a set of linguis-                      -er                  -r
     tic t-attributes (such as functor, formeme, number, gender,                       -ty                  -ta
     deep tense) that capture the function of the word. Inflec-                        -is                  -e
     tions and auxiliary words are not explictly represented, but                      -in                  -ín
     their functions are captured by the attributes of the t-nodes.                    -ine                 -ín
        Each source t-tree is then isomorphically transferred                          -ing                 -ování
                                                                                       -cy                  -ce
     to a target t-tree. In the standard TectoMT setup, the t-
                                                                                       -y                   -ie
     lemma of each t-node is translated by models that have
     been trained on large parallel data. The other t-attributes      Table 4: A list of ending-based transformations of noun
     are then transferred by a pipeline featuring both rule-based     lemmas.
     and machine-learned steps.
        Finally, the target sentence is synthesized from the t-
                                                                         The transformations are generally applied sequentially,
     tree. This step relies heavily on a morphological generator
                                                                      but forking is possible at some places, and so multiple al-
     [12], which is able to generate a word form based on the
                                                                      ternative Czechizations may be generated; TectoMT uses
     word lemma and a set of morphological feature values. For
                                                                      a Hidden Markov Tree Model [14] (instead of a lan-
     the highly flective Czech language, this is a challenging
                                                                      guage model) to eventually select the best combination
     task; even though we employ a state-of-the-art generator, it
                                                                      of t-lemmas (and other t-attributes). However, as the
     is sometimes unable to generate the requested word form,
                                                                      Czechizations are usually OOVs for the HMTM, typically
     especially when the lemma is unknown to the generator.
                                                                      the first candidate gets selected. The target semantic part-
        TectoMT can (and does by default) use a weighted inter-       of-speech identifier is also generated, based on the source
     polation of multiple translation models to generate trans-       semantic part-of-speech and the t-lemma ending; this is
     lation candidates [10]. This makes it easy to replace or         important for the subsequent synthesis steps.
     complement the existing models with new models, such                It should be noted that the current implementation of
     as our Czechizator model.                                        Czechizator is rather a proof-of-concept than an attempt
                                                                      on a professional translation model. If one was to follow
                                                                      this research path in future, it would be presumably more
     2.2   Czechizator translation model                              appropriate to learn the regular transformations from par-
                                                                      allel (or comparable) corpora, extracting pairs of similar
     The Czechizator translation model attempts to Czechize           words that are translations of each other and generalizing
     each English t-lemma, unless it is marked as a named en-         the transformation necessary to convert one into the other,
     tity. To Czechize the lemma, it applies the following re-        as well as learning to identify the cases in which a transfor-
     sources, which we manually constructed:                          mation should be applied. Similar methods could be used
                                                                      as were applied e.g. in the semi-supervised morphological
       • a shortlist of 36 lemma translations, focusing on            generator Flect [2].
         words that we believe to be auxiliaries rather than             Czechizator uses the standard TectoMT translation
         full words (and thus presumably should be dropped            model interface, and can thus be easily and seamlessly
         by the t-analysis and represented by t-attributes,           plugged into the standard TectoMT pipeline, either replac-
         but in fact constitute t-lemmas),3 and on cardinal           ing or complementing the base lexical translation models.
         numbers (which presumably should be converted to
         a language-independent representation by TectoMT             2.3     Surrogate lemma inflection
         analysis, but are not),
                                                                      As Czechizator generates many weird and/or non-existent
       • a set of 43 transformation rules based on semantic           lemmas, it is an expected consequence that the morpholog-
         part of speech of the t-node and the ending of its t-        ical generator is often unable to inflect these lemmas. For
         lemma (noun rules are provided as an example in Ta-                3 be, have, do, and, or, but, therefore, that, who, which, what, why,

         ble 4), and                                                  how, each, other, then, also, so, as, all, this, these, many, only, main,
                                                                      mainly
                                                                          4 As an example, we list several of the transliteration rules here:
       • a transliteration table, consisting of 33 transliteration    th→t, ti→ci, ck→k, ph→f, sh→š, ch→ch, cz→č, qu→kv, igh→aj,
         rules.4                                                      gh→ch, gu→gv, dg→dž, w→v, c→k.
Czechizator – Čechizátor                                                                                                                                        77

            Ending                                      Surrogate lemma                         Setup                       BLEU       NIST
            -ovat                                       kupovat                                 Untranslated source          3.41       1.13
            -ání                                        plavání                                 No model                     2.85       1.62
            -í                                          jarní                                   Czechizator                  3.01       2.08
            -ý                                          mladý                                   Base TectoMT                 8.75       3.62
            -o                                          město                                  Base + Czechizator           8.33       3.57
            -e                                          růže
            -a                                          žena                     Table 6: Automatic evaluation scores on the ÚFAL ab-
            -ost                                        kost                     stracts dataset[9].
            -ě                                         mladě
            -h, k, r, d, t, n, b, f, l, m, p, s, v, z   svrab
            -ž, š, ř, č, c, j, d’, t’, ň             muž                      the Institute of Formal and Applied Linguistics at Charles
                                                                                 University in Prague, who are obliged to provide both a
      Table 5: List of surrogate lemmas for given endings. The                   Czech and an English abstract for each of their publica-
      matched ending gets deleted from the target lemma, ob-                     tions. These are then stored in the institute’s database of
      taining the target pseudo-stem, except for the last two                    publications, Biblio,8 , and can be accessed through a reg-
      cases (matching hard or soft final consonants), where even                 ularly generated XML dump.9 .
      the final consonant is part of the stem.                                      The collected parallel corpus, aligned on the document
                                                                                 level, e.g. on individual abstracts, contains 1,556 pairs of
                                                                                 abstracts, totalling 121,386 words on the English side and
      this reason, we enriched the word form generation compo-
                                                                                 76,812 words on the Czech side.10 We did not perform
      nent of TectoMT5 with a last-resort inflection step.6 If the
                                                                                 any filtering of the data, apart from filtering out incom-
      morphogenerator is unable to generate the inflection, we
                                                                                 plete entries (missing the Czech or the English abstract)
      use a set of simple ending-based rules to find a surrogate
                                                                                 and replacing newlines and tabulators by spaces (solely for
      lemma, as listed in Table 5,7 inflect the surrogate lemma,
                                                                                 technical reasons). The dataset is publicly available [9].
      strip its ending, and apply it to the target lemma. We fo-
      cus on endings generated by the Czechizator translation
      module, but we aimed for high coverage, and successfully                   3.2    Evaluation and discussion
      managed to employ the last-resort inflector even into the
      base TectoMT translation.                                                  Automatic evaluation with BLEU and NIST was per-
         For example, if one is to inflect the pseudo-adjective                  formed with the MTrics tool [5]. We evaluated several
      “largový” (Czechization of “large”) for the feminine ac-                   candidate translations: the untranslated English source
      cusative, we replace it with the surrogate lemma (“mladý”)                 texts, TectoMT with no lexical model, TectoMT with the
      that corresponds to its ending (“-ý”), obtain its fem-                     Czechizator model, TectoMT with an interpolation of its
      inine accusative inflection from the morphogenerator                       base lexical models (the default setup of TectoMT), and
      (“mladou”), strip the matched ending from both of the                      TectoMT with an interpolation of Czechizator and the base
      lemmas, obtaining pseudo-stems (“largov”, “mlad”), strip                   lexical models.
      the surrogate pseudo-stem (“mlad”) from the surrogate in-                     While translation quality of the Czechizator outputs is
      flection (“mladou”) to obtain the inflection ending (“-ou”),               clearly well below the base TectoMT system, the results
      and join the ending with the target pseudo-stem (“largov”)                 show that Czechizator does manage to produce some use-
      to obtain the target inflection (“largovou”).                              ful output – its scores are significantly higher than that of
                                                                                 TectoMT with no lexical translation model. This shows
                                                                                 that lexicon-less translation is somewhat possible in our
      3     Evaluation                                                           setting, although on average it is far from competitive – at
                                                                                 least with the current version of Czechizator, which is a
      3.1   Dataset                                                              rather basic proof-of-concept implementation, lacking nu-
                                                                                 merous simple and obvious improvements that could eas-
      To automatically evaluate the translation quality by stan-                 ily be performed and would presumably lead to further sig-
      dard methods, we collected a small dataset, consisting of                  nificant increases of translation quality. However, as with
      Czech and English abstracts of scientific papers. Specifi-                 many rule-based systems for natural language processing,
      cally, we collected the abstracts of papers of authors from                the code complexity and especially the amount of manual
           5 https://github.com/ufal/treex/blob/master/lib/                      tuning necessary to push the performance further and fur-
      Treex/Block/T2A/CS/GenerateWordforms.pm                                    ther is likely to grow very quickly.
           6 https://github.com/ufal/treex/commit/
                                                                                       8 http://ufal.mff.cuni.cz/biblio/
      363d1b18f7140e0cb687ed8deebc4ac4a1051080
           7 Although there exists a set of commonly used lemmas to represent         9 https://svn.ms.mff.cuni.cz/trac/biblio/browser/

      the basic Czech paradigms, we sometimes use a different lemma – to         trunk/xmldump
      avoid unnecessary ambiguity, and to simplify the application of the end-       10 The difference in the sizes is partially caused by the fact that usu-

      ing to the target lemma (we avoid surrogate lemmas that exhibit changes    ally, the English abstract is the full original, and its Czech translation is
      on the root during inflection).                                            often shortened considerably by the authors.
78                                                                                                                                    R. Rosa

        Manual inspection of the outputs (see also the examples     standable to readers (e.g. “tríbank” for “treebank”, “tvít”
     in the beginning of this paper) showed that the chosen do-     for “tweet”, or “kros-lingvální” for “cross-lingual” – here
     main is quite suitable for lexicon-less translation, but the   the base models generated a rather nonsensical “lingual
     proportion of autosemantic words that cannot be simply         kříže”). Unfortunately, it also often Czechizes named en-
     transformed from English to Czech without a lexicon is         tities, even though we explicitly avoid them if they are
     still rather high – high enough to make many of the sen-       marked by the analysis; this seems to be primarily a short-
     tences barely comprehensible. We therefore acknowledge         coming (or unsuitability for this task) of the named en-
     that at least a small lexicon would be necessary to obtain     tity recognizer used [12], which seems to favour preci-
     reasonable translations for most sentences. On the other       sion over recall. Still, Czechizator can sometimes provide
     hand, we observed many phrases, and occasionally even          a better translation than the base models, even in cases
     whole sentences, whose Czechizations were of a rather          where the term is not an OOV – such as the word “post-
     high quality and understandable to Czech speakers with         editing”, which the base models translate into a confus-
     minor or no difficulties. We thus find our approach in-        ing “poúprava”, while Czechizator provides an acceptable
     teresting and potentially promising, although we believe       translation “post-editování”.11
     that the amount of work needed to bring the system to a            In general, we believe that, if appropriate attention is
     competitive level of translation quality would be by sev-      paid to the identified issues, such as named entities avoid-
     eral orders of magnitude larger than that spent on creat-      ance, Czechizator has the potential of usefully comple-
     ing the current system (which took less than one person-       menting the base TectoMT translation models, especially
     week). Still, we expect that for the given domain, develop-    in handling OOV terms.
     ing such a rule-based system would constitute many times
     less work than building an open-domain system.
        Thanks to the deep analysis and generation provided by      4    Conclusion
     TectoMT, the Czechizations tend to be rather grammati-
     cal, with words correctly inflected, even if non-sensical.     We implemented a rule-based lexicon-less English-Czech
     Unfortunately, even grammatical errors occur rather fre-       translation model into TectoMT, called Czechizator. The
     quently – some words are not inflected at all, some violate    model is based on a set of simple rules, mainly follow-
     morphological agreement (e.g. in gender, case or number),      ing regularities in adoption of English terms into Czech.
     etc. This can be explained by realizing that the complex       Czechizator has been especially designed for and applied
     TectoMT pipeline consists of many subcomponents, each          to the domain of abstracts of scientific papers, but also
     operating with a certain precision, occasionally producing     provides interesting results for texts from the marketing
     erroneous analyses. The most crucial stage seems to be         domain.
     syntactic parsing, which has been reported to have only           We automatically evaluated Czechizator on a collection
     approximately 85% accuracy, i.e. roughly 15% of depen-         of abstracts of computational linguistics papers, showing
     dency relations are assigned incorrectly; these typically      inferior but promising results in comparison with the base
     manifest themselves as agreement errors in the Czechiza-       TectoMT models; the highest observed potential is in em-
     tion output.                                                   ploying Czechizator as an additional TectoMT translation
        Evaluation of the main potential use case of Czechiza-      model for out-of-vocabulary items.
     tor, i.e. complementing base TectoMT translation mod-             Czechizator is released as an open-source Treex module
     els for OOVs (Base + Czechizator setup), brought mixed         in the main Treex repository on Github,12 and is also made
     results. There is a small deterioration in the automatic       available as an online demo.13
     scores, and subsequent manual inspection showed that
     Czechizator can target OOVs only semi-sucessfully. It          Acknowledgments
     can offer a Czechization of any OOV term, which is of-
     ten correct (e.g. “anafora” for English “anaphora”, “inter-    This research was supported by the grants GAUK
     lingvální” for “interlingual”, “hypotaktický” for “hypotac-    1572314, and SVV 260 333. This work has been using
     tical”, or “cirkumfixální” for “circumfixal”), but some-       language resources and tools developed, stored and dis-
     times the Czechization is not correct (e.g. “businost” for     tributed by the LINDAT/CLARIN project of the Ministry
     “business”, “hands-onový” for “hands-on”, or “kolokaty”        of Education, Youth and Sports of the Czech Republic
     for “collocations”). In many cases, a Czechization of the      (project LM2015071).
     term is simply not used in practice, and is less under-
     standable to the reader than the original English form (e.g.
     “kejnotový” for “keynote”, “veb-pagová” for “web-page”,            11 Other such examples include the Czechization “reimplemen-

     “part-of-spích” for “part-of-speech”, or “kros-langvaž” for    tace” for “reimplementation” instead of “znovuprovádění”, or “post-
     “cross-language”). Czechizator also often generates a          nominální” for “post-nominal” instead of “pojmenovitý”.
                                                                        12 https://github.com/ufal/treex/blob/master/lib/
     form that is plausible but rarely or never used, although      Treex/Tool/TranslationModel/Rulebased/Model.pm
     one may think that the Czechized form may become the               13 http://ufallab.ms.mff.cuni.cz/~rosa/czechizator/

     standard Czech translation in future, and is mostly under-     input.php
Czechizator – Čechizátor                                                                                                                  79

      References                                                               editor, ITAT, volume 788, pages 7–14, Košice, Slovakia,
                                                                               2011. Univerzita Pavla Jozefa Šafárika v Košiciach.
       [1] Ondřej Dušek, Luís Gomes, Michal Novák, Martin Popel,         [14] Zdeněk Žabokrtský and Martin Popel. Hidden Markov tree
           and Rudolf Rosa. New language pairs in tectoMT. In Pro-             model in dependency-based machine translation. In Pro-
           ceedings of the 10th Workshop on Machine Translation,               ceedings of the ACL-IJCNLP 2009 Conference Short Pa-
           pages 98–104, Stroudsburg, PA, USA, 2015. Association               pers, pages 145–148, Suntec, Singapore, 2009. Association
           for Computational Linguistics, Association for Computa-             for Computational Linguistics.
           tional Linguistics.                                            [15] Zdeněk Žabokrtský, Jan Ptáček, and Petr Pajas. TectoMT:
       [2] Ondřej Dušek and Filip Jurčíček. Training a natural lan-         Highly modular MT system with tectogrammatics used as
           guage generator from unaligned data. In Proceedings of the          transfer layer. In ACL 2008 WMT: Proceedings of the Third
           53rd Annual Meeting of the Association for Computational            Workshop on Statistical Machine Translation, pages 167–
           Linguistics and the 7th International Joint Conference on           170, Columbus, OH, USA, 2008. Association for Compu-
           Natural Language Processing (Volume 1: Long Papers),                tational Linguistics.
           pages 451–461, Stroudsburg, PA, USA, 2015. Association
           for Computational Linguistics, Association for Computa-
           tional Linguistics.
       [3] Jan Hajič, Vladislav Kuboň, and Jan Hric. Česílko - an MT
           system for closely related languages. In ACL2000, Tuto-
           rial Abstracts and Demonstration Notes, pages 7–8. ACL,
           ISBN 1-55860-730-7, 2000.
       [4] Petr Homola and Vladislav Kuboň. Česílko 2.0, 2008.
       [5] Kamil Kos. Adaptation of new machine translation metrics
           for Czech. Bachelor’s thesis, Charles University in Prague,
           2008.
       [6] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
           Dean. Efficient estimation of word representations in vec-
           tor space. arXiv preprint arXiv:1301.3781, 2013.
       [7] Martin Popel, Roman Sudarikov, Ondřej Bojar, Rudolf
           Rosa, and Jan Hajič. TectoMT – a deep-linguistic core of
           the combined chimera MT system. Baltic Journal of Mod-
           ern Computing, 4(2):377–377, 2016.
       [8] Martin Popel and Zdeněk Žabokrtský. TectoMT: Modular
           NLP framework. In Hrafn Loftsson, Eirikur Rögnvaldsson,
           and Sigrun Helgadottir, editors, Lecture Notes in Artificial
           Intelligence, Proceedings of the 7th International Confer-
           ence on Advances in Natural Language Processing (IceTAL
           2010), volume 6233 of Lecture Notes in Computer Science,
           pages 293–304, Berlin / Heidelberg, 2010. Iceland Centre
           for Language Technology (ICLT), Springer.
       [9] Rudolf Rosa. Czech and English abstracts of ÚFAL papers,
           2016. LINDAT/CLARIN digital library at Institute of For-
           mal and Applied Linguistics, Charles University in Prague.
      [10] Rudolf Rosa, Ondřej Dušek, Michal Novák, and Martin
           Popel. Translation model interpolation for domain adapta-
           tion in TectoMT. In Jan Hajič and António Branco, editors,
           Proceedings of the 1st Deep Machine Translation Work-
           shop, pages 89–96, Praha, Czechia, 2015. ÚFAL MFF UK,
           ÚFAL MFF UK.
      [11] Petr Sgall, Eva Hajičová, and Jarmila Panevová. The mean-
           ing of the sentence in its semantic and pragmatic aspects.
           Springer, 1986.
      [12] Jana Straková, Milan Straka, and Jan Hajič. Open-Source
           Tools for Morphology, Lemmatization, POS Tagging and
           Named Entity Recognition. In Proceedings of 52nd An-
           nual Meeting of the Association for Computational Lin-
           guistics: System Demonstrations, pages 13–18, Baltimore,
           Maryland, June 2014. Association for Computational Lin-
           guistics.
      [13] Zdeněk Žabokrtský. Treex – an open-source framework
           for natural language processing. In Markéta Lopatková,