Treex – an open-source framework for natural language processing? Zdeněk Žabokrtský Charles University in Prague, Institute of Formal and Applied Linguistics Malostranské náměstı́ 25, 118 00 Prague, Czech Republic zabokrtsky@ufal.mff.cuni.cz WWW home page: http://ufal.mff.cuni.cz/~zabokrtsky Abstract. The present paper describes Treex (formerly decision process (such as searching for optimal feature TectoMT), a multi-purpose open-source framework for de- weights and other model parameters) is then entirely veloping Natural Language Processing applications. It fa- left to the learning algorithm. cilitates the development by exploiting a wide range of soft- Recent developments in NLP show that another ware modules already integrated in Treex, such as tools for sentence segmentation, tokenization, morphological analy- paradigm shift might be approaching with unsuper- sis, part-of-speech tagging, shallow and deep syntax pars- vised and semi-supervised algorithms, which are able ing, named entity recognition, anaphora resolution, sen- to learn from data without hand-made annotations. tence synthesis, word-level alignment of parallel corpora, However, such algorithms require considerably more and other tasks. The most elaborate application of Treex is complex models and for most NLP tasks they have an English-Czech machine translation system with trans- not outperformed supervised solutions based on hand- fer on deep syntactic (tectogrammatical) layer. Besides re- annotated data so far. search, Treex is used for teaching purposes and helps stu- Nowadays, researched NLP tasks range from rel- dents to implement morphological and syntactic analyzers of foreign languages in a very short time. atively simple ones (like sentence segmentation, lan- guage identification), through tasks which already need a higher level of abstraction (such as morpholog- 1 Introduction ical analysis, part-of-speech tagging, parsing, named entity recognition, coreference resolution, word sense Natural Language Processing (NLP) is a multidisci- disambiguation, sentiment analysis, natural language plinary field combining computer science, mathemat- generation), to highly complex systems (machine ics and linguistics, whose main aim is to allow com- translation, automatic summarization, or question an- puters to work with information expressed in human swering). The importance of (and demand for) such (natural) language. tasks increases along with the rapidly growing amount The history of NLP goes back to 1950s. Early NLP of textual information available on the Internet. systems were based on hand-written rules founded by Many NLP applications exploit several NLP mod- linguistic intuitions. However, roughly two decades ago ules chained in a pipeline (such as a sentence seg- the growing availability of language data (especially menter and part-of-speech tagger prior to a parser). textual corpora) and increasing capabilities of com- However, if state-of-the-art solutions created by dif- puter systems lead to a revolution in NLP: the field ferent authors – often written in different program- became dominated by data-driven approaches, often ming languages, with different interfaces, using dif- based on probabilistic modeling and machine learning. ferent data formats and encodings – are to be used, In such data-driven scenario, the role of hu- a significant effort must be invested into integrating man experts was moved from designing rules rather to the tools. Even if these issues are only of technical na- (i) preparing training data enriched with linguistically ture, in real research they constitute one of limiting relevant information (usually by manual annotation), factors for building more complex NLP applications. (ii) choice of an adequate probabilistic model, propos- We try to eliminate such problems by introducing ing features (various indicators potentially useful for a common NLP framework that integrates a number making the desired predictions), and (iii) specifying an of NLP tools and provides them with unified object- objective (evaluation) function. Optimization of the oriented interfaces, which hide the technical issues ? The presented research is supported by the grants from the developer of a larger application. The frame- MSM0021620838 and by the European Commission’s work’s architecture seems viable – tens of researchers 7FP grant agreement n◦ 231720 (EuroMatrix Plus). We and students have already contributed to the system would like to thank Martin Popel for useful comments and the framework has been already used for a number on the paper. of research tasks carried out at the Institute of For- 8 Zdeněk Žabokrtský mal and Applied linguistics as well as at some other 2. analytical layer (a-layer) research institutions. The most complex application Each sentence is represented as a shallow-syntax implemented within the framework is English-Czech dependency tree (a-tree). There is one-to-one cor- machine translation. The framework is called Treex.1 respondence between m-layer tokens and a-layer The remainder of the paper is structured as fol- nodes (a-nodes). Each a-node is annotated with lows. Section 2 overviews related work that had to be the so-called analytical function, which represents taken into account when developing such framework. the type of dependency relation to its parent Section 3 presents the main design decisions Treex (i.e. its governing node). is build on. English-Czech machine translation imple- 3. tectogrammatical layer (t-layer) mented in Treex is described in Section 4, while other Each sentence is represented as a deep-syntax de- Treex applications are mentioned in Section 5, which pendency tree (t-tree). Autosemantic (meaning- also concludes. ful) words are represented as t-layer nodes (t-nodes). Information conveyed by functional words (such as auxiliary verbs, prepositions and 2 Related work subordinating conjunctions) is represented by attributes of t-nodes. Most important attributes 2.1 Theoretical background of t-nodes are: tectogrammatical lemma, functor (which represents the semantic value of syntactic Natural language is an immensely complicated phe- dependency relation) and a set of grammatemes nomenon. Modeling the language in its entirety would (e.g. tense, number, verb modality, deontic modal- be extremely complex, therefore its description is of- ity, negation). ten decomposed into several subsequent layers (levels). Edges in t-trees represent linguistic dependencies There is no broadly accepted consensus on details con- except for several special cases, the most notable cerning the individual levels, however, the layers typ- of which are paratactic structures (coordinations). ically roughly correspond to the following scale: pho- netics, phonology, morphology, syntax, semantics, and All three layers of annotation are described in an- pragmatics. notation manuals distributed with PDT 2.0. One of such stratificational hypotheses is Func- This annotation scheme has been adopted and fur- tional Generative Description (FGD), developed by ther modified in Treex. One of the modifications con- Petr Sgall and his colleagues in Prague since the sists in merging m-layer and a-layer sentence represen- 1960s [18]. FGD was used with certain modifications as tations into a single data structure.2 the theoretical framework underlying the Prague De- pendency Treebank [6], which is a manually annotated Treex also profits from the technology developed corpus of Czech newspaper texts from the 1990s. PDT during the PDT project, especially from the existence in version 2.0 (PDT 2.0) adds three layers of linguistic of the highly customizable tree editor TrEd, which is annotation to the original texts: used as the main visualization tool in Treex, and from the XML-based file format PML (Prague Markup Lan- 1. morphological layer (m-layer) guage, [14]), which is used as the main data format in Each sentence is tokenized and each token is an- Treex. notated with a lemma (basic word form, such as nominative singular for nouns) and morphological 2.2 Other NLP frameworks tag (describing morphological categories such as part of speech, number, and tense). Treex is not the only existing general NLP frame- 1 The framework was originally called TectoMT since work. We are aware of the following other frameworks starting its development in autumn 2005 [23], because (a more detailed comparison can be found in [15]): one of the sources of motivation for building the frame- work was developing a Machine translation (MT) system – ETAP-3 [1] is a C/C++ closed-source NLP frame- using tectogrammatical (deep-syntactic) sentence repre- work for English-Russian and Russian-English sentation as the transfer medium. However, MT is by translation, developed in the Russian Academy of far not the only application of the framework. As the Sciences. name seemed to be rather discouraging for those NLP 2 developers whose research interests did not overlap with As mentioned above, their units are in a one-to-one rela- tectogrammatics nor with MT, TectoMT was rebranded tion anyway; merging the two structures together has led to Treex in spring 2011. To avoid confusion, the name to a significant reduction of time and memory require- Treex is used throughout the whole text even if it refers ments when processing large data, as well as to a lower to a more distant history. burden for eyes when browsing the structures. Treex – an open-source framework for NLP 9 – GATE (Java, LGPL) is one of the most widely during the translation process are made by statistical used NLP frameworks with integrated graphical models like in SMT, not by rules. user interface. It is being developed at University of Sheffield [4]. – Apache OpenNLP (Java, LGPL)3 is an organiza- 3 Treex architecture overview tional center for open source NLP projects. 3.1 Basic design decisions – WebLicht4 is a Service Oriented Architecture for building annotated German text corpora. The architecture of Treex is based on the following – Apertium [20] is a free/open-source machine trans- decisions: lation platform with shallow transfer. – Treex is primarily developed in Linux. However, In our opinion, none of these frameworks seems fea- platform independent solutions are searched for sible (or mature enough) for experiments on MT based wherever possible. on deep-syntactic dependency transfer. The only ex- – The main programming language of Treex is Perl. ception is ETAP-3, whose theoretical assumptions are However, a number of tools written in other lan- similar to that of Treex (its dependency-based stratifi- guages have been integrated into Treex (after pro- cational background theory called Meaning-Text The- viding them with a Perl wrapper). ory [13] bears several resemblances to FGD), however, – Linguistic interpretability – data structures repre- it is not an open-source project. senting natural language sentences in Treex must be understandable by a human (so that e.g. trans- lation errors can be traced back to their source). 2.3 Contemporary machine translation Comfortable visualization of the data structures is MT is a notoriously hard problem and it is studied supported. by a broad research field nowadays: every year there – Modularity – NLP tools in Treex are designed so are several conferences, workshops and tutorials dedi- that they are easily reusable for various tasks (not cated to it (or even to its subfields). It goes beyond the only for MT), scope of this work even to mention all the contempo- – Rules-vs-statistics neutrality – Treex architecture rary approaches to MT, but several elaborate surveys is neutral with respect to the rules vs. statistics of current approaches to MT are already available to opposition (rule-based as well as statistical solu- the reader elsewhere, e.g. in [10]. tions are combined). A distinction is usually made between two MT – Massive data – Treex must be capable of process- paradigms: rule-based MT (RBMT) and sta- ing large data (such as millions of sentence pairs tistical MT (SMT). The rule-based MT systems are in parallel corpora), which implies that distributed dependent on the availability of linguistic knowledge processing must be supported. (such as grammar rules and dictionaries), whereas sta- – Language universality – ideally, Treex should be tistical MT systems require human-translated parallel easily extendable to any natural language. text, from which they extract the translation knowl- – Data interchange support – XML is used as the edge automatically. One of the representatives of the main storage format in Treex, but Treex must be first group is the already mention system ETAP-3. able to work with a number of other data formats Nowadays, the most popular representatives of the used in NLP. second group are phrase-based systems (in which the term ‘phrase’ stands simply for a sequence of words, 3.2 Data structure units not necessarily corresponding to phrases in constituent syntax), e.g. [8], derived from the IBM models [3]. In Treex, representations of a text in a natural lan- Even if phrase-based systems have more or less guage is structured as follows: dominated the field in the recent years, their trans- – Document. A Treex document is the smallest in- lation quality is still far from perfect. Therefore we dependently storable unit. A document represents believe it makes sense to investigate also alternative a piece of text (or several parallel pieces of texts approaches. in the case of multilingual data) and its linguistic MT implemented in Treex lies somewhere between representations. A document contains an ordered the two main paradigms. Like in RBMS, sentence rep- sequence of bundles. resentations used in Treex are linguistically in- – Bundle. A bundle corresponds to a sentence (or terpretable. However, the most important decisions a tuple of sentences in the case of parallel data) 3 and its linguistic representations. A bundle con- http://opennlp.sourceforge.net 4 http://weblicht.sfs.uni-tuebingen.de/englisch/index.shtml tains a set of zones. 10 Zdeněk Žabokrtský – Zone. Each language (languages are distinguished (a) Simple Treex scenario: using ISO 639-2 codes in Treex) can have one or Util::SetGlobal language=en # do everyth. in English zone more zones in a bundle.5 Each zone corresponds to Block::Read::Text W2A::Segment # read a text from STDIN # segment it into sentences one particular sentence and at most one tree for W2A::Tokenize # divide sentences into words each layer of linguistic description. W2A::EN::TagMorce # morphological tagging W2A::EN::Lemmatiz # lemmatization (basic word forms) – Tree. All sentence representations in Treex have W2A::EN::ParseMST # dependency parsing the shape of an oriented tree.6 At this moment W2A::EN::SetAfunAuxCPCoord W2A::EN::SetAfun # fill analytical functions # fill analytical functions there are four types of trees: (1) a-trees – morphol- Write::CoNLLX # print trees in CoNLLX format ogy and surface-dependency (analytical) trees, Write::Treex # store trees into XML file (2) t-trees – tectogrammatical trees, (3) p-trees – (b) Input text example: phrase-structure (constituency) trees, (4) n-trees – trees of named entities. When the prince mentions the rose, the geographer explains that he does not record roses, calling them "ephemeral". – Node. Each nodes contains (is labeled by) a set of The prince is shocked and hurt by this revelation. The attributes (name-value pairs). geographer recommends that he visit the Earth. – Attribute. Some node attributes are universal (c) Fragment from the printed output (simplified): (such as identifier), but most of them are specific for a certain layer. The set of attribute names and 1 The the DT 2 their values for a node on a particular layer is de- 23 prince is prince be NN VBZ 3 0 clared using the Treex PML schema.7 Attribute 4 shocked shock VBN 5 values can be further structured. 5 and and CC 3 6 hurt hurt VBN 5 7 by by IN 5 Of course, there are also many other types of data 8 this this DT 9 9 revelation revelation NN 7 structures used by individual integrated modules (such 10 . . . 3 as dictionary lists, weight vectors and other trained parameters, etc.), but they are usually hidden behind (d) A-tree visualization in TrEd: module interfaces and no uniform structure is required for them. a-tree zone=en 3.3 Processing units is Pred There are two basic levels of processing units in Treex: VBZ – Block. Blocks are the smallest processing units in- prince and . Sb NR AuxG dependently applicable on a document. NN CC . – Scenario. Scenarios are sequences of blocks. When a scenario is applied on a document, the blocks The shocked hurt by from the sequence are applied on the document AuxA NR NR AuxP DT VBN VBN IN one after another. revelation Adv NN this Atr 5 DT Having more zones per language is useful e.g. for com- paring machine translation with reference translation, or translation outputs from several systems. Moreover Fig. 1. Simple scenario for morphological and surface- it highly simplifies processing of parallel corpora, or syntactic analysis of English texts. Generated trees are comparisons of alternative implementations of a certain printed in the CoNLLX format, which is a simple line- tasks (such as different dependency parsers). 6 oriented format for representing dependency trees. However, tree-crossing edges such as anaphora links in a dependency tree can be represented too (as node at- tributes). 7 There are also “wild” attributes allowed, which can store any Perl data structure without its prior declaration by PML. However, such undeclared attributes should serve only for tentative or rapid development purposes, as they cannot be validated. Treex – an open-source framework for NLP 11 A block can change a document’s content “in 4 English-Czech machine translation place”8 via a predefined object-oriented interface. One in Treex can distinguish several broad categories of blocks: – blocks for sentence analysis – blocks for tokeniza- The translation scenario implemented in Treex com- tion, morphological tagging, parsing, anaphora poses of three steps described in the following sec- resolution, etc. tions: (1) analysis of the input sentences up to tec- – blocks for sentence synthesis – blocks for propa- togrammatical layer of abstraction, (2) transfer of the gating agreement categories, ordering words, in- abstract representation to the target language, and flecting word forms, adding punctuation, etc. (3) synthesis (generating) of sentences in the target – blocks for transfer – blocks for translating a com- language. See an example in Figure 2. ponent of a linguistic representation from one lan- guage to another, etc. – blocks for parallel texts – blocks for word align- 4.1 Analysis ment, etc. – writer and reader blocks – block for stor- ing/loading Treex documents into/from files or The analysis step can be decomposed into three phases other streams (in the PML or other format),9 corresponding to morphological, analytical and tec- – auxiliary blocks – blocks for testing, printing, etc. togrammatical analysis. In the morphological phase, a text to be trans- If possible, we try to implement blocks in a lan- lated is segmented into sentences and each sentence guage independent way. However, many blocks will re- is tokenized (segmented into words and punctuation main language specific (for instance a block for moving marks). Tokens are tagged with part of speech and clitics in Czech clauses can hardly be reused for any other morphological categories by the Morce tag- other language). ger [19], and lemmatized. There are large differences in complexity of blocks. In the analytical phase, each sentence is parsed us- Some blocks contain just a few simple rules (such as ing the dependency parser [12] based on Maximum regular expressions for sentence segmentation), while Spanning Tree algorithm, which results in an analyti- other blocks are Perl wrappers for quite complex prob- cal tree for each sentence. Tree nodes are labeled with abilistic models resulting from several years of research analytical functions (such as Sb for subject, Pred for (such as blocks for parsing). predicate, and Adv for adverbial). As for block granularity, there are no widely agreed Then the analytical trees are converted to the tec- conventions for decomposing large NLP applica- togrammatical trees. Each autosemantic word with its tions.10 We only follow general recommendations for associated functional words is collapsed into a sin- system modularization. A piece of functionality should gle tectogrammatical node, labeled with lemma, func- be performed by a separate block if it has well defined tor (semantic role), formeme,11 and semantically in- input and output states of Treex data structures, if it dispensable morphologically categories (such as tense can be reused in more applications and/or it can be with verbs and number with nouns, but not number (at least potentially) replaced by some other solution. with verbs as it is only imposed by subject-predicate agreement). Coreference of pronouns is also resolved and tectogrammatical nodes are enriched with infor- mation on named entities (such as the distinction be- 8 tween location, person and organization) resulting Pipeline processing (like with Unix text-processing com- mands) is not feasible here since linguistic data are from Stanford Named Entity Recognizer [5]. deeply structured and the price for serializing the data 11 at each boundary would be high. Formemes specify how tectogrammatical nodes are re- 9 In the former versions, format converters were consid- alized in the surface sentence shape. For instance, ered as tools separated from scenarios. However, provid- n:subj stands for semantic noun in the subject posi- ing the converters with the uniform block interface al- tion, n:for+X for semantic noun with preposition for, lows to read/write data directly within a scenario, which v:because+fin for semantic verb in a subordinating is not only more elegant, but also more efficient (inter- clause introduced by the conjunction because, adj:attr for mediate serialization and storage can be skipped). semantic adjective in attributive position. Formemes do 10 For instance, some taggers provides both morphological not constitute a genuine tectogrammatical component tag and lemma for each word form, while other taggers as they are not oriented semantically (but rather mor- must be followed by a subsequent lemmatizer in order phologically and syntactically). However, they have been to achieve the same functionality. added to t-trees in Treex as they facilitate the transfer. 12 Zdeněk Žabokrtský Fig. 2. Analysis-transfer-synthesis translation scenario in Treex applied on the English sentence “However, this very week, he tried to find refuge in Brazil.”, leading to the Czech translation “Přesto se tento právě týden snažil najı́t útočiště v Brazı́lii.”. Thick edges indicate functional and autosemantic a-nodes to be merged. 4.2 Transfer labelling is then revealed by the tree-modified Viterbi algorithm [22]. The transfer phase follows, whose most difficult part Originally, we estimated the translation model sim- consists in labeling the tree with target-language lem- ply by using pair frequencies extracted from English- mas and formemes. Changes of tree topology and of Czech parallel data. A significant improvement was other attributes12 are required relatively infrequently. reached after replacing such model by Maximum En- Our model for choosing the right target-language tropy model. In the model, we employed a wide range lemmas and formemes in inspired by Noisy Channel of features resulting from the source-side analysis. The Model which is the standard approach in the contem- weights were optimized using training data extracted porary SMT and which combines a translation model from the CzEng parallel treebank [2], which contains and a language model of the target language. In other roughly 6 million English-Czech pairs of analyzed and words, one should not rely only on the information aligned sentences. on how faithfully the meaning is transfered by some translation equivalent, but also the additional model 4.3 Synthesis can be used which estimates how well some translation equivalent fits to the surrounding context.13 Finally, surface sentence shape is synthesized from the Unlike in the mainstream SMT, in tectogrammat- tectogrammatical tree, which is basically a reverse ical transfer we do not use this idea for linear struc- operation for the tectogrammatical analysis: adding tures, but for trees. So the translation model estimates punctuation and functional words, spreading morpho- the probability of source and target lemma pair, while logical categories according to grammatical agree- the language tree model estimates the probability of ment, performing inflection (using Czech morphology a lemma given its parent. The globally optimal tree database [7]), arranging word order etc. 12 For instance, number of nouns must be changed to plural if the selected target Czech lemma is a plurale tantum. 4.4 Evaluating translation quality Similarly, verb tense must be predicted if an English infinitive or gerund verb form is translated to a finite There are two general methods for evaluating transla- verb form. tion quality of outputs of MT systems: (1) the quality 13 This corresponds to the intuition that translating to can be judged by humans (either using a set of criteria one’s native language is simpler for a human than trans- such as grammaticality and intelligibility, or relatively lating to a foreign language. by comparing outputs of different MT systems), or Treex – an open-source framework for NLP 13                    !  "   !!#        '       $%&   Fig. 3. Tectogrammatical transfer implemented as Hidden Markov Tree Model. (2) the quality can be estimated by automatic met- – linguistic data processing service for other research rics, which usually measure some form of string-wise carried out in other institutions, such as data anal- overlap of an MT system’s output with one or more yses for prosody prediction for The University of reference (human-made) translations. West Bohemia [17]. Both types of evaluation are used regularly during the development of our MT system. Automatic metrics Treex significantly simplifies code sharing across are used after any change of the translation scenario, individual research projects in our institute. There are as they are cheap and fast to perform. Large scale around 15 programmers (postgraduate students and evaluations by volunteer judges are organized annu- researchers) who have significantly contributed to the ally as a shared task with the Workshop on Statistical development of Treex in the last years; four of them Machine Translation.14 Performance of the tectogram- are responsible for developing the central components matical translation increases every year in both mea- of the framework infrastructure called Treex Core. sures, and it already outperforms some commercial as Last but not least, Treex is used for teaching pur- well as academic systems. Actually, it is the participa- poses in our institute. Undergraduate students are tion in this shared task (a competition, in other words) supposed to develop their own modules for morpho- what provides the strongest motivation momentum for logical and syntactic analysis for foreign languages of Treex developers. their choice. Not only that the existence of Treex en- ables the students to make very fast progress, but their contributions are accumulated in the Treex Subver- 5 Final remarks and conclusions sion repository too, which enlarges the repertory of languages treatable by Treex.15 Even if tectogrammatical translation is considered as the main application of Treex, Treex has been used for There are two main challenges for the Treex devel- a number of other research purposes as well: opers now. The first challenge is to continue improving the tectogrammatical translation quality by better ex- – other MT-related tasks – Treex has been used for ploitation of the training data. The second challenge developing alternative MT quality measures in [9], is to widen the community of Treex users and devel- and for improving outputs of other MT systems by opers by distributing majority of Treex modules via grammatical post-processing in [11], CPAN (Comprehensive Perl Archive Network), which – building linguistic data resources – Treex has been is a broadly respected repository of Perl modules. employed in the development of resources such When thinking about a more distant future of MT as the Prague Czech-English Dependency Tree- and NLP in general, an exciting question arises about bank [21], the Czech-English parallel corpus the future relationship of linguistically interpretable CzEng [2], and Tamil Dependency Treebank [16]. 15 There are modules for more than 20 languages available 14 http://www.statmt.org/wmt11/ in Treex now. 14 Zdeněk Žabokrtský approaches (like that of Treex) and purely statisti- 12. R . McDonald, F. Pereira, K. Ribarov, and J. Hajič: cal phrase-based approaches. Promising results of [11], Non-projective dependency parsing using spanning tree which uses Treex for improving the output of a phrase- algorithms. In Proceedings of Human Langauge Tech- based system and thus reaches the state-of-the-art MT nology Conference and Conference on Empirical Meth- quality in English-Czech MT, show that combinations ods in Natural Language Processing, pp. 523–530, Vancouver, BC, Canada, 2005. of both approaches might be viable. 13. I.A. Mel’čuk: Dependency syntax: theory and practice. State University of New York Press, 1988. References 14. P. Pajas and J. Štěpánek: Recent advances in a feature- rich framework for treebank annotation. In Proceed- ings of the 22nd International Conference on Compu- 1. I. Boguslavsky, L. Iomdin, and V. Sizov: Multilin- tational Linguistics, volume 2, pp. 673–680, Manch- guality in ETAP-3: reuse of lexical resources. In ester, UK, 2008. G. Sérasset, (ed.), COLING 2004 Multilingual Linguis- 15. M. Popel and Z. Žabokrtský: TectoMT: modular NLP tic Resources, pp. 1–8, Geneva, Switzerland, August 28 framework. In Lecture Notes in Artificial Intelligence, 2004. COLING. Proceedings of the 7th International Conference on 2. O. Bojar, M. Janı́ček, Z. Žabokrtský, P. Češka, and Advances in Natural Language Processing (IceTAL P. Beňa: CzEng 0.7: parallel corpus with community- 2010), volume 6233 of LNCS, pp. 293–304, Berlin / supplied translations. In Proceedings of the Sixth In- Heidelberg, 2010. Springer. ternational Language Resources and Evaluation, Mar- 16. L. Ramasamy and Z. Žabokrtský: Tamil dependency rakech, Morocco, 2008. ELRA. parsing: results using rule based and corpus based ap- 3. P.E. Brown, V.J. Della Pietra, S.A. Della Pietra, proaches. In Proceedings of 12th International Con- and R.L. Mercer: The mathematics of statistical ma- ference CICLing 2011, volume 6608 of Lecture Notes chine translation: parameter estimation. Computa- in Computer Science, pp. 82–95, Berlin / Heidelberg, tional Linguistics, 1993. 2011. Springer. 4. H. Cunningham, D. Maynard, K. Bontcheva, and 17. J. Romportl: Zvyšovánı́ přirozenosti strojově vytvářené V. Tablan: GATE: an architecture for development řeči v oblasti suprasegmentálnı́ch zvukových jevů. PhD of robust HLT applications. In Proceedings of the Thesis, Faculty of Applied Sciences, University of West 40th Annual Meeting on Association for Computa- Bohemia, Pilsen, Czech Republic, 2008. tional Linguistics, July, pp. 07–12, 2002. 18. P. Sgall, E. Hajičová, and J. Panevová: The Meaning 5. J.R. Finkel, T. Grenager, and C. Manning: Incorporat- of the sentence in its semantic and pragmatic aspects. ing non-local information into information extraction D. Reidel Publishing Company, Dordrecht, 1986. systems by gibbs sampling. In ACL ’05: Proceedings of 19. D. Spoustová, J. Hajič, J. Votrubec, P. Krbec, and the 43rd Annual Meeting on Association for Computa- P. Květoň: The best of two worlds: cooperation of sta- tional Linguistics, pp. 363–370, Morristown, NJ, USA, tistical and rule-based taggers for Czech. In Proceed- 2005. Association for Computational Linguistics. ings of the Workshop on Balto-Slavonic Natural Lan- 6. J. Hajič, E. Hajičová, J. Panevová, P. Sgall, P. Pajas, guage Processing, ACL 2007, pp. 67–74, Praha, 2007. J. Štěpánek, J. Havelka, and M. Mikulová: Prague De- 20. F.M. Tyers, F.Sánchez-Martánez, S Ortiz-Rojas, and pendency Treebank 2.0. Linguistic Data Consortium, M.L. Forcada: Free/open-source resources in the Aper- LDC Catalog No.: LDC2006T01, Philadelphia, 2006. tium platform for machine translation research and de- 7. J. Hajič: Disambiguation of rich inflection – computa- velopment. Prague Bulletin of Mathematical Linguis- tional morphology of Czech. Charles University – The tics, 93, 2010, 67–76. Karolinum Press, Prague, 2004. 21. J. Šindlerová, L. Mladová, J. Toman, and S. Cinková: 8. P. Koehn et al: Moses: open source toolkit for statisti- An application of the PDT-scheme to a parallel tree- cal machine translation. In Proceedings of the Demo bank. In Proceedings of the 6th International Work- and Poster Sessions, 45th Annual Meeting of ACL, shop on Treebanks and Linguistic Theories (TLT pp. 177–180, Prague, Czech Republic, June 2007. As- 2007), pp. 163–174, Bergen, Norway, 2007. sociation for Computational Linguistics. 22. Z. Žabokrtský and M. Popel: Hidden Markov tree 9. K. Kos and O. Bojar: Evaluation of machine transla- model in dependency-based machine translation. In tion metrics for Czech as the target language. Prague Proceedings of the 47th Annual Meeting of the As- Bulletin of Mathematical Linguistics, 92, 2009. sociation for Computational Linguistics, 2009. 10. A. Lopez: A survey of statistical machine translation. 23. Z. Žabokrtský, J. Ptáček, and P. Pajas. TectoMT: Technical Report, Institute for Advanced Computer Highly modular MT system with tectogrammatics used Studies, University of Maryland, 2007. as transfer layer. In Proceedings of the 3rd Workshop 11. D. Mareček, R. Rosa, P. Galuščáková, and O. Bojar: on Statistical Machine Translation, ACL, 2008. Two-step translation with grammatical post-processing. In Proceedings of the 6th Workshop on Statistical Ma- chine Translation, pp.426–432, Edinburgh, Scotland, 2011. Association for Computational Linguistics.