=Paper= {{Paper |id=None |storemode=property |title=Treex - an open-source framework for natural language processing |pdfUrl=https://ceur-ws.org/Vol-788/paper2.pdf |volume=Vol-788 |dblpUrl=https://dblp.org/rec/conf/itat/Zabokrtsky11 }} ==Treex - an open-source framework for natural language processing== https://ceur-ws.org/Vol-788/paper2.pdf
                              Treex – an open-source framework
                               for natural language processing?

                                                   Zdeněk Žabokrtský

                       Charles University in Prague, Institute of Formal and Applied Linguistics
                               Malostranské náměstı́ 25, 118 00 Prague, Czech Republic
                                            zabokrtsky@ufal.mff.cuni.cz
                             WWW home page: http://ufal.mff.cuni.cz/~zabokrtsky

Abstract. The present paper describes Treex (formerly           decision process (such as searching for optimal feature
TectoMT), a multi-purpose open-source framework for de-         weights and other model parameters) is then entirely
veloping Natural Language Processing applications. It fa-       left to the learning algorithm.
cilitates the development by exploiting a wide range of soft-
                                                                    Recent developments in NLP show that another
ware modules already integrated in Treex, such as tools for
sentence segmentation, tokenization, morphological analy-       paradigm shift might be approaching with unsuper-
sis, part-of-speech tagging, shallow and deep syntax pars-      vised and semi-supervised algorithms, which are able
ing, named entity recognition, anaphora resolution, sen-        to learn from data without hand-made annotations.
tence synthesis, word-level alignment of parallel corpora,      However, such algorithms require considerably more
and other tasks. The most elaborate application of Treex is     complex models and for most NLP tasks they have
an English-Czech machine translation system with trans-         not outperformed supervised solutions based on hand-
fer on deep syntactic (tectogrammatical) layer. Besides re-     annotated data so far.
search, Treex is used for teaching purposes and helps stu-
                                                                    Nowadays, researched NLP tasks range from rel-
dents to implement morphological and syntactic analyzers
of foreign languages in a very short time.
                                                                atively simple ones (like sentence segmentation, lan-
                                                                guage identification), through tasks which already
                                                                need a higher level of abstraction (such as morpholog-
1     Introduction                                              ical analysis, part-of-speech tagging, parsing, named
                                                                entity recognition, coreference resolution, word sense
Natural Language Processing (NLP) is a multidisci-              disambiguation, sentiment analysis, natural language
plinary field combining computer science, mathemat-             generation), to highly complex systems (machine
ics and linguistics, whose main aim is to allow com-            translation, automatic summarization, or question an-
puters to work with information expressed in human              swering). The importance of (and demand for) such
(natural) language.                                             tasks increases along with the rapidly growing amount
     The history of NLP goes back to 1950s. Early NLP           of textual information available on the Internet.
systems were based on hand-written rules founded by                 Many NLP applications exploit several NLP mod-
linguistic intuitions. However, roughly two decades ago         ules chained in a pipeline (such as a sentence seg-
the growing availability of language data (especially           menter and part-of-speech tagger prior to a parser).
textual corpora) and increasing capabilities of com-            However, if state-of-the-art solutions created by dif-
puter systems lead to a revolution in NLP: the field            ferent authors – often written in different program-
became dominated by data-driven approaches, often               ming languages, with different interfaces, using dif-
based on probabilistic modeling and machine learning.           ferent data formats and encodings – are to be used,
     In such data-driven scenario, the role of hu-              a significant effort must be invested into integrating
man experts was moved from designing rules rather to            the tools. Even if these issues are only of technical na-
(i) preparing training data enriched with linguistically        ture, in real research they constitute one of limiting
relevant information (usually by manual annotation),            factors for building more complex NLP applications.
(ii) choice of an adequate probabilistic model, propos-             We try to eliminate such problems by introducing
ing features (various indicators potentially useful for         a common NLP framework that integrates a number
making the desired predictions), and (iii) specifying an        of NLP tools and provides them with unified object-
objective (evaluation) function. Optimization of the            oriented interfaces, which hide the technical issues
?
    The presented research is supported by the grants           from the developer of a larger application. The frame-
    MSM0021620838 and by the European Commission’s              work’s architecture seems viable – tens of researchers
    7FP grant agreement n◦ 231720 (EuroMatrix Plus). We         and students have already contributed to the system
    would like to thank Martin Popel for useful comments        and the framework has been already used for a number
    on the paper.                                               of research tasks carried out at the Institute of For-
8        Zdeněk Žabokrtský

mal and Applied linguistics as well as at some other            2. analytical layer (a-layer)
research institutions. The most complex application                Each sentence is represented as a shallow-syntax
implemented within the framework is English-Czech                  dependency tree (a-tree). There is one-to-one cor-
machine translation. The framework is called Treex.1               respondence between m-layer tokens and a-layer
    The remainder of the paper is structured as fol-               nodes (a-nodes). Each a-node is annotated with
lows. Section 2 overviews related work that had to be              the so-called analytical function, which represents
taken into account when developing such framework.                 the type of dependency relation to its parent
Section 3 presents the main design decisions Treex                 (i.e. its governing node).
is build on. English-Czech machine translation imple-           3. tectogrammatical layer (t-layer)
mented in Treex is described in Section 4, while other             Each sentence is represented as a deep-syntax de-
Treex applications are mentioned in Section 5, which               pendency tree (t-tree). Autosemantic (meaning-
also concludes.                                                    ful) words are represented as t-layer nodes
                                                                   (t-nodes). Information conveyed by functional
                                                                   words (such as auxiliary verbs, prepositions and
2     Related work                                                 subordinating conjunctions) is represented by
                                                                   attributes of t-nodes. Most important attributes
2.1     Theoretical background                                     of t-nodes are: tectogrammatical lemma, functor
                                                                   (which represents the semantic value of syntactic
Natural language is an immensely complicated phe-                  dependency relation) and a set of grammatemes
nomenon. Modeling the language in its entirety would               (e.g. tense, number, verb modality, deontic modal-
be extremely complex, therefore its description is of-             ity, negation).
ten decomposed into several subsequent layers (levels).            Edges in t-trees represent linguistic dependencies
There is no broadly accepted consensus on details con-             except for several special cases, the most notable
cerning the individual levels, however, the layers typ-            of which are paratactic structures (coordinations).
ically roughly correspond to the following scale: pho-
netics, phonology, morphology, syntax, semantics, and
                                                                    All three layers of annotation are described in an-
pragmatics.
                                                                notation manuals distributed with PDT 2.0.
    One of such stratificational hypotheses is Func-
                                                                    This annotation scheme has been adopted and fur-
tional Generative Description (FGD), developed by
                                                                ther modified in Treex. One of the modifications con-
Petr Sgall and his colleagues in Prague since the
                                                                sists in merging m-layer and a-layer sentence represen-
1960s [18]. FGD was used with certain modifications as
                                                                tations into a single data structure.2
the theoretical framework underlying the Prague De-
pendency Treebank [6], which is a manually annotated                Treex also profits from the technology developed
corpus of Czech newspaper texts from the 1990s. PDT             during the PDT project, especially from the existence
in version 2.0 (PDT 2.0) adds three layers of linguistic        of the highly customizable tree editor TrEd, which is
annotation to the original texts:                               used as the main visualization tool in Treex, and from
                                                                the XML-based file format PML (Prague Markup Lan-
1. morphological layer (m-layer)                                guage, [14]), which is used as the main data format in
    Each sentence is tokenized and each token is an-            Treex.
    notated with a lemma (basic word form, such as
    nominative singular for nouns) and morphological
                                                        2.2 Other NLP frameworks
    tag (describing morphological categories such as
    part of speech, number, and tense).
                                                        Treex is not the only existing general NLP frame-
1
  The framework was originally called TectoMT since work. We are aware of the following other frameworks
  starting its development in autumn 2005 [23], because (a more detailed comparison can be found in [15]):
    one of the sources of motivation for building the frame-
    work was developing a Machine translation (MT) system        – ETAP-3 [1] is a C/C++ closed-source NLP frame-
    using tectogrammatical (deep-syntactic) sentence repre-        work for English-Russian and Russian-English
    sentation as the transfer medium. However, MT is by            translation, developed in the Russian Academy of
    far not the only application of the framework. As the          Sciences.
    name seemed to be rather discouraging for those NLP
                                                                2
    developers whose research interests did not overlap with        As mentioned above, their units are in a one-to-one rela-
    tectogrammatics nor with MT, TectoMT was rebranded              tion anyway; merging the two structures together has led
    to Treex in spring 2011. To avoid confusion, the name           to a significant reduction of time and memory require-
    Treex is used throughout the whole text even if it refers       ments when processing large data, as well as to a lower
    to a more distant history.                                      burden for eyes when browsing the structures.
                                                                 Treex – an open-source framework for NLP          9

 – GATE (Java, LGPL) is one of the most widely             during the translation process are made by statistical
   used NLP frameworks with integrated graphical           models like in SMT, not by rules.
   user interface. It is being developed at University
   of Sheffield [4].
 – Apache OpenNLP (Java, LGPL)3 is an organiza-            3     Treex architecture overview
   tional center for open source NLP projects.
                                                           3.1    Basic design decisions
 – WebLicht4 is a Service Oriented Architecture for
   building annotated German text corpora.                 The architecture of Treex is based on the following
 – Apertium [20] is a free/open-source machine trans-      decisions:
   lation platform with shallow transfer.
                                                             – Treex is primarily developed in Linux. However,
     In our opinion, none of these frameworks seems fea-       platform independent solutions are searched for
sible (or mature enough) for experiments on MT based           wherever possible.
on deep-syntactic dependency transfer. The only ex-          – The main programming language of Treex is Perl.
ception is ETAP-3, whose theoretical assumptions are           However, a number of tools written in other lan-
similar to that of Treex (its dependency-based stratifi-       guages have been integrated into Treex (after pro-
cational background theory called Meaning-Text The-            viding them with a Perl wrapper).
ory [13] bears several resemblances to FGD), however,        – Linguistic interpretability – data structures repre-
it is not an open-source project.                              senting natural language sentences in Treex must
                                                               be understandable by a human (so that e.g. trans-
                                                               lation errors can be traced back to their source).
2.3   Contemporary machine translation
                                                               Comfortable visualization of the data structures is
MT is a notoriously hard problem and it is studied             supported.
by a broad research field nowadays: every year there         – Modularity – NLP tools in Treex are designed so
are several conferences, workshops and tutorials dedi-         that they are easily reusable for various tasks (not
cated to it (or even to its subfields). It goes beyond the     only for MT),
scope of this work even to mention all the contempo-         – Rules-vs-statistics neutrality – Treex architecture
rary approaches to MT, but several elaborate surveys           is neutral with respect to the rules vs. statistics
of current approaches to MT are already available to           opposition (rule-based as well as statistical solu-
the reader elsewhere, e.g. in [10].                            tions are combined).
    A distinction is usually made between two MT             – Massive data – Treex must be capable of process-
paradigms: rule-based MT (RBMT) and sta-                       ing large data (such as millions of sentence pairs
tistical MT (SMT). The rule-based MT systems are               in parallel corpora), which implies that distributed
dependent on the availability of linguistic knowledge          processing must be supported.
(such as grammar rules and dictionaries), whereas sta-       – Language universality – ideally, Treex should be
tistical MT systems require human-translated parallel          easily extendable to any natural language.
text, from which they extract the translation knowl-         – Data interchange support – XML is used as the
edge automatically. One of the representatives of the          main storage format in Treex, but Treex must be
first group is the already mention system ETAP-3.              able to work with a number of other data formats
    Nowadays, the most popular representatives of the          used in NLP.
second group are phrase-based systems (in which the
term ‘phrase’ stands simply for a sequence of words, 3.2 Data structure units
not necessarily corresponding to phrases in constituent
syntax), e.g. [8], derived from the IBM models [3].          In Treex, representations of a text in a natural lan-
    Even if phrase-based systems have more or less guage is structured as follows:
dominated the field in the recent years, their trans-
                                                              – Document. A Treex document is the smallest in-
lation quality is still far from perfect. Therefore we
                                                                 dependently storable unit. A document represents
believe it makes sense to investigate also alternative
                                                                 a piece of text (or several parallel pieces of texts
approaches.
                                                                 in the case of multilingual data) and its linguistic
    MT implemented in Treex lies somewhere between
                                                                 representations. A document contains an ordered
the two main paradigms. Like in RBMS, sentence rep-
                                                                 sequence of bundles.
resentations used in Treex are linguistically in-
                                                              – Bundle. A bundle corresponds to a sentence (or
terpretable. However, the most important decisions
                                                                 a tuple of sentences in the case of parallel data)
 3                                                               and its linguistic representations. A bundle con-
   http://opennlp.sourceforge.net
 4
   http://weblicht.sfs.uni-tuebingen.de/englisch/index.shtml     tains a set of zones.
10     Zdeněk Žabokrtský

 – Zone. Each language (languages are distinguished (a) Simple Treex scenario:
   using ISO 639-2 codes in Treex) can have one or Util::SetGlobal language=en # do everyth. in English zone
   more zones in a bundle.5 Each zone corresponds to Block::Read::Text
                                                         W2A::Segment
                                                                               # read a text from STDIN
                                                                               # segment it into sentences
   one particular sentence and at most one tree for W2A::Tokenize               # divide sentences into words
   each layer of linguistic description.                 W2A::EN::TagMorce     # morphological tagging
                                                         W2A::EN::Lemmatiz     # lemmatization (basic word forms)
 – Tree. All sentence representations in Treex have W2A::EN::ParseMST # dependency parsing
   the shape of an oriented tree.6 At this moment W2A::EN::SetAfunAuxCPCoord
                                                         W2A::EN::SetAfun
                                                                                      # fill analytical functions
                                                                               # fill analytical functions
   there are four types of trees: (1) a-trees – morphol- Write::CoNLLX          # print trees in CoNLLX format
   ogy and surface-dependency (analytical) trees, Write::Treex                 # store trees into XML file

   (2) t-trees – tectogrammatical trees, (3) p-trees – (b) Input text example:
   phrase-structure (constituency) trees, (4) n-trees
   – trees of named entities.                            When the prince mentions the rose, the geographer explains
                                                         that he does not record roses, calling them "ephemeral".
 – Node. Each nodes contains (is labeled by) a set of The prince is shocked and hurt by this revelation. The
   attributes (name-value pairs).                        geographer recommends that he visit the Earth.

 – Attribute. Some node attributes are universal
                                                         (c) Fragment from the printed output (simplified):
   (such as identifier), but most of them are specific
   for a certain layer. The set of attribute names and 1 The              the          DT    2
   their values for a node on a particular layer is de- 23 prince
                                                             is
                                                                          prince
                                                                          be
                                                                                       NN
                                                                                       VBZ
                                                                                             3
                                                                                             0
   clared using the Treex PML schema.7 Attribute 4 shocked                shock        VBN   5
   values can be further structured.                     5   and          and          CC    3
                                                         6   hurt         hurt         VBN   5
                                                              7   by              by        IN     5
    Of course, there are also many other types of data 8 this         this       DT 9
                                                       9   revelation revelation NN 7
structures used by individual integrated modules (such 10 .           .          .  3
as dictionary lists, weight vectors and other trained
parameters, etc.), but they are usually hidden behind (d) A-tree visualization in TrEd:
module interfaces and no uniform structure is required
for them.                                                 a-tree
                                                          zone=en


3.3   Processing units
                                                                        is
                                                                        Pred
There are two basic levels of processing units in Treex:                VBZ


 – Block. Blocks are the smallest processing units in-                 prince   and           .
                                                                       Sb       NR            AuxG
   dependently applicable on a document.                               NN       CC            .
 – Scenario. Scenarios are sequences of blocks. When
   a scenario is applied on a document, the blocks                The  shocked         hurt by
   from the sequence are applied on the document                  AuxA NR              NR AuxP
                                                                  DT   VBN             VBN IN
   one after another.
                                                                                            revelation
                                                                                            Adv
                                                                                            NN


                                                                                           this
                                                                                           Atr
5                                                                                          DT
  Having more zones per language is useful e.g. for com-
  paring machine translation with reference translation,
  or translation outputs from several systems. Moreover
                                                              Fig. 1. Simple scenario for morphological and surface-
  it highly simplifies processing of parallel corpora, or
                                                              syntactic analysis of English texts. Generated trees are
  comparisons of alternative implementations of a certain
                                                              printed in the CoNLLX format, which is a simple line-
  tasks (such as different dependency parsers).
6                                                             oriented format for representing dependency trees.
  However, tree-crossing edges such as anaphora links in
  a dependency tree can be represented too (as node at-
  tributes).
7
  There are also “wild” attributes allowed, which can store
  any Perl data structure without its prior declaration
  by PML. However, such undeclared attributes should
  serve only for tentative or rapid development purposes,
  as they cannot be validated.
                                                                       Treex – an open-source framework for NLP            11

   A block can change a document’s content “in                 4      English-Czech machine translation
place”8 via a predefined object-oriented interface. One               in Treex
can distinguish several broad categories of blocks:

 – blocks for sentence analysis – blocks for tokeniza-         The translation scenario implemented in Treex com-
   tion, morphological tagging, parsing, anaphora              poses of three steps described in the following sec-
   resolution, etc.                                            tions: (1) analysis of the input sentences up to tec-
 – blocks for sentence synthesis – blocks for propa-           togrammatical layer of abstraction, (2) transfer of the
   gating agreement categories, ordering words, in-            abstract representation to the target language, and
   flecting word forms, adding punctuation, etc.               (3) synthesis (generating) of sentences in the target
 – blocks for transfer – blocks for translating a com-         language. See an example in Figure 2.
   ponent of a linguistic representation from one lan-
   guage to another, etc.
 – blocks for parallel texts – blocks for word align-          4.1      Analysis
   ment, etc.
 – writer and reader blocks – block for stor-
   ing/loading Treex documents into/from files or              The analysis step can be decomposed into three phases
   other streams (in the PML or other format),9                corresponding to morphological, analytical and tec-
 – auxiliary blocks – blocks for testing, printing, etc.       togrammatical analysis.
                                                                   In the morphological phase, a text to be trans-
    If possible, we try to implement blocks in a lan-          lated is segmented into sentences and each sentence
guage independent way. However, many blocks will re-           is tokenized (segmented into words and punctuation
main language specific (for instance a block for moving        marks). Tokens are tagged with part of speech and
clitics in Czech clauses can hardly be reused for any          other morphological categories by the Morce tag-
other language).                                               ger [19], and lemmatized.
    There are large differences in complexity of blocks.           In the analytical phase, each sentence is parsed us-
Some blocks contain just a few simple rules (such as           ing the dependency parser [12] based on Maximum
regular expressions for sentence segmentation), while          Spanning Tree algorithm, which results in an analyti-
other blocks are Perl wrappers for quite complex prob-         cal tree for each sentence. Tree nodes are labeled with
abilistic models resulting from several years of research      analytical functions (such as Sb for subject, Pred for
(such as blocks for parsing).                                  predicate, and Adv for adverbial).
    As for block granularity, there are no widely agreed           Then the analytical trees are converted to the tec-
conventions for decomposing large NLP applica-                 togrammatical trees. Each autosemantic word with its
tions.10 We only follow general recommendations for            associated functional words is collapsed into a sin-
system modularization. A piece of functionality should         gle tectogrammatical node, labeled with lemma, func-
be performed by a separate block if it has well defined        tor (semantic role), formeme,11 and semantically in-
input and output states of Treex data structures, if it        dispensable morphologically categories (such as tense
can be reused in more applications and/or it can be            with verbs and number with nouns, but not number
(at least potentially) replaced by some other solution.        with verbs as it is only imposed by subject-predicate
                                                               agreement). Coreference of pronouns is also resolved
                                                               and tectogrammatical nodes are enriched with infor-
                                                               mation on named entities (such as the distinction be-
8                                                              tween location, person and organization) resulting
   Pipeline processing (like with Unix text-processing com-
   mands) is not feasible here since linguistic data are       from Stanford Named Entity Recognizer [5].
   deeply structured and the price for serializing the data
                                                               11
   at each boundary would be high.                                  Formemes specify how tectogrammatical nodes are re-
 9
   In the former versions, format converters were consid-           alized in the surface sentence shape. For instance,
   ered as tools separated from scenarios. However, provid-         n:subj stands for semantic noun in the subject posi-
   ing the converters with the uniform block interface al-          tion, n:for+X for semantic noun with preposition for,
   lows to read/write data directly within a scenario, which        v:because+fin for semantic verb in a subordinating
   is not only more elegant, but also more efficient (inter-        clause introduced by the conjunction because, adj:attr for
   mediate serialization and storage can be skipped).               semantic adjective in attributive position. Formemes do
10
   For instance, some taggers provides both morphological           not constitute a genuine tectogrammatical component
   tag and lemma for each word form, while other taggers            as they are not oriented semantically (but rather mor-
   must be followed by a subsequent lemmatizer in order             phologically and syntactically). However, they have been
   to achieve the same functionality.                               added to t-trees in Treex as they facilitate the transfer.
12      Zdeněk Žabokrtský




Fig. 2. Analysis-transfer-synthesis translation scenario in Treex applied on the English sentence “However, this very
week, he tried to find refuge in Brazil.”, leading to the Czech translation “Přesto se tento právě týden snažil najı́t
útočiště v Brazı́lii.”. Thick edges indicate functional and autosemantic a-nodes to be merged.


4.2   Transfer                                                labelling is then revealed by the tree-modified Viterbi
                                                              algorithm [22].
The transfer phase follows, whose most difficult part             Originally, we estimated the translation model sim-
consists in labeling the tree with target-language lem-       ply by using pair frequencies extracted from English-
mas and formemes. Changes of tree topology and of             Czech parallel data. A significant improvement was
other attributes12 are required relatively infrequently.      reached after replacing such model by Maximum En-
    Our model for choosing the right target-language          tropy model. In the model, we employed a wide range
lemmas and formemes in inspired by Noisy Channel              of features resulting from the source-side analysis. The
Model which is the standard approach in the contem-           weights were optimized using training data extracted
porary SMT and which combines a translation model             from the CzEng parallel treebank [2], which contains
and a language model of the target language. In other         roughly 6 million English-Czech pairs of analyzed and
words, one should not rely only on the information            aligned sentences.
on how faithfully the meaning is transfered by some
translation equivalent, but also the additional model
                                                              4.3    Synthesis
can be used which estimates how well some translation
equivalent fits to the surrounding context.13                 Finally, surface sentence shape is synthesized from the
    Unlike in the mainstream SMT, in tectogrammat-            tectogrammatical tree, which is basically a reverse
ical transfer we do not use this idea for linear struc-       operation for the tectogrammatical analysis: adding
tures, but for trees. So the translation model estimates      punctuation and functional words, spreading morpho-
the probability of source and target lemma pair, while        logical categories according to grammatical agree-
the language tree model estimates the probability of          ment, performing inflection (using Czech morphology
a lemma given its parent. The globally optimal tree           database [7]), arranging word order etc.
12
   For instance, number of nouns must be changed to plural
   if the selected target Czech lemma is a plurale tantum.    4.4    Evaluating translation quality
   Similarly, verb tense must be predicted if an English
   infinitive or gerund verb form is translated to a finite   There are two general methods for evaluating transla-
   verb form.                                                 tion quality of outputs of MT systems: (1) the quality
13
   This corresponds to the intuition that translating to      can be judged by humans (either using a set of criteria
   one’s native language is simpler for a human than trans-   such as grammaticality and intelligibility, or relatively
   lating to a foreign language.                              by comparing outputs of different MT systems), or
                                                                             Treex – an open-source framework for NLP        13




                                                                                 

                                                                                            
                                                                                             

                                                      ! 
                                                      "                         
                                                      !!#
                                                                                                       
                                                                                                      
                                                                                                        
                                                       '

                                        
                                                                                                          



                                                       
                                                      
                                                       $%&
                                                       


                   Fig. 3. Tectogrammatical transfer implemented as Hidden Markov Tree Model.


(2) the quality can be estimated by automatic met-                    – linguistic data processing service for other research
rics, which usually measure some form of string-wise                    carried out in other institutions, such as data anal-
overlap of an MT system’s output with one or more                       yses for prosody prediction for The University of
reference (human-made) translations.                                    West Bohemia [17].
    Both types of evaluation are used regularly during
the development of our MT system. Automatic metrics                     Treex significantly simplifies code sharing across
are used after any change of the translation scenario,              individual research projects in our institute. There are
as they are cheap and fast to perform. Large scale                  around 15 programmers (postgraduate students and
evaluations by volunteer judges are organized annu-                 researchers) who have significantly contributed to the
ally as a shared task with the Workshop on Statistical              development of Treex in the last years; four of them
Machine Translation.14 Performance of the tectogram-                are responsible for developing the central components
matical translation increases every year in both mea-               of the framework infrastructure called Treex Core.
sures, and it already outperforms some commercial as                    Last but not least, Treex is used for teaching pur-
well as academic systems. Actually, it is the participa-            poses in our institute. Undergraduate students are
tion in this shared task (a competition, in other words)            supposed to develop their own modules for morpho-
what provides the strongest motivation momentum for                 logical and syntactic analysis for foreign languages of
Treex developers.                                                   their choice. Not only that the existence of Treex en-
                                                                    ables the students to make very fast progress, but their
                                                                    contributions are accumulated in the Treex Subver-
5      Final remarks and conclusions                                sion repository too, which enlarges the repertory of
                                                                    languages treatable by Treex.15
Even if tectogrammatical translation is considered as
the main application of Treex, Treex has been used for                  There are two main challenges for the Treex devel-
a number of other research purposes as well:                        opers now. The first challenge is to continue improving
                                                                    the tectogrammatical translation quality by better ex-
 – other MT-related tasks – Treex has been used for                 ploitation of the training data. The second challenge
   developing alternative MT quality measures in [9],               is to widen the community of Treex users and devel-
   and for improving outputs of other MT systems by                 opers by distributing majority of Treex modules via
   grammatical post-processing in [11],                             CPAN (Comprehensive Perl Archive Network), which
 – building linguistic data resources – Treex has been              is a broadly respected repository of Perl modules.
   employed in the development of resources such                        When thinking about a more distant future of MT
   as the Prague Czech-English Dependency Tree-                     and NLP in general, an exciting question arises about
   bank [21], the Czech-English parallel corpus                     the future relationship of linguistically interpretable
   CzEng [2], and Tamil Dependency Treebank [16].                  15
                                                                          There are modules for more than 20 languages available
14
     http://www.statmt.org/wmt11/                                         in Treex now.
14      Zdeněk Žabokrtský

approaches (like that of Treex) and purely statisti- 12. R . McDonald, F. Pereira, K. Ribarov, and J. Hajič:
cal phrase-based approaches. Promising results of [11],  Non-projective dependency parsing using spanning tree
which uses Treex for improving the output of a phrase-   algorithms. In Proceedings of Human Langauge Tech-
based system and thus reaches the state-of-the-art MT    nology Conference and Conference on Empirical Meth-
quality in English-Czech MT, show that combinations      ods in Natural Language Processing, pp. 523–530,
                                                         Vancouver, BC, Canada, 2005.
of both approaches might be viable.
                                                                  13. I.A. Mel’čuk: Dependency syntax: theory and practice.
                                                                      State University of New York Press, 1988.
References                                                        14. P. Pajas and J. Štěpánek: Recent advances in a feature-
                                                                      rich framework for treebank annotation. In Proceed-
                                                                      ings of the 22nd International Conference on Compu-
 1. I. Boguslavsky, L. Iomdin, and V. Sizov: Multilin-
                                                                      tational Linguistics, volume 2, pp. 673–680, Manch-
    guality in ETAP-3: reuse of lexical resources. In
                                                                      ester, UK, 2008.
    G. Sérasset, (ed.), COLING 2004 Multilingual Linguis-
                                                                  15. M. Popel and Z. Žabokrtský: TectoMT: modular NLP
    tic Resources, pp. 1–8, Geneva, Switzerland, August 28
                                                                      framework. In Lecture Notes in Artificial Intelligence,
    2004. COLING.
                                                                      Proceedings of the 7th International Conference on
 2. O. Bojar, M. Janı́ček, Z. Žabokrtský, P. Češka, and
                                                                      Advances in Natural Language Processing (IceTAL
    P. Beňa: CzEng 0.7: parallel corpus with community-
                                                                      2010), volume 6233 of LNCS, pp. 293–304, Berlin /
    supplied translations. In Proceedings of the Sixth In-
                                                                      Heidelberg, 2010. Springer.
    ternational Language Resources and Evaluation, Mar-
                                                                  16. L. Ramasamy and Z. Žabokrtský: Tamil dependency
    rakech, Morocco, 2008. ELRA.
                                                                      parsing: results using rule based and corpus based ap-
 3. P.E. Brown, V.J. Della Pietra, S.A. Della Pietra,
                                                                      proaches. In Proceedings of 12th International Con-
    and R.L. Mercer: The mathematics of statistical ma-
                                                                      ference CICLing 2011, volume 6608 of Lecture Notes
    chine translation: parameter estimation. Computa-
                                                                      in Computer Science, pp. 82–95, Berlin / Heidelberg,
    tional Linguistics, 1993.
                                                                      2011. Springer.
 4. H. Cunningham, D. Maynard, K. Bontcheva, and
                                                                  17. J. Romportl: Zvyšovánı́ přirozenosti strojově vytvářené
    V. Tablan: GATE: an architecture for development
                                                                      řeči v oblasti suprasegmentálnı́ch zvukových jevů. PhD
    of robust HLT applications. In Proceedings of the
                                                                      Thesis, Faculty of Applied Sciences, University of West
    40th Annual Meeting on Association for Computa-
                                                                      Bohemia, Pilsen, Czech Republic, 2008.
    tional Linguistics, July, pp. 07–12, 2002.
                                                                  18. P. Sgall, E. Hajičová, and J. Panevová: The Meaning
 5. J.R. Finkel, T. Grenager, and C. Manning: Incorporat-
                                                                      of the sentence in its semantic and pragmatic aspects.
    ing non-local information into information extraction
                                                                      D. Reidel Publishing Company, Dordrecht, 1986.
    systems by gibbs sampling. In ACL ’05: Proceedings of
                                                                  19. D. Spoustová, J. Hajič, J. Votrubec, P. Krbec, and
    the 43rd Annual Meeting on Association for Computa-
                                                                      P. Květoň: The best of two worlds: cooperation of sta-
    tional Linguistics, pp. 363–370, Morristown, NJ, USA,
                                                                      tistical and rule-based taggers for Czech. In Proceed-
    2005. Association for Computational Linguistics.
                                                                      ings of the Workshop on Balto-Slavonic Natural Lan-
 6. J. Hajič, E. Hajičová, J. Panevová, P. Sgall, P. Pajas,
                                                                      guage Processing, ACL 2007, pp. 67–74, Praha, 2007.
    J. Štěpánek, J. Havelka, and M. Mikulová: Prague De-
                                                                  20. F.M. Tyers, F.Sánchez-Martánez, S Ortiz-Rojas, and
    pendency Treebank 2.0. Linguistic Data Consortium,
                                                                      M.L. Forcada: Free/open-source resources in the Aper-
    LDC Catalog No.: LDC2006T01, Philadelphia, 2006.
                                                                      tium platform for machine translation research and de-
 7. J. Hajič: Disambiguation of rich inflection – computa-
                                                                      velopment. Prague Bulletin of Mathematical Linguis-
    tional morphology of Czech. Charles University – The
                                                                      tics, 93, 2010, 67–76.
    Karolinum Press, Prague, 2004.
                                                                  21. J. Šindlerová, L. Mladová, J. Toman, and S. Cinková:
 8. P. Koehn et al: Moses: open source toolkit for statisti-
                                                                      An application of the PDT-scheme to a parallel tree-
    cal machine translation. In Proceedings of the Demo
                                                                      bank. In Proceedings of the 6th International Work-
    and Poster Sessions, 45th Annual Meeting of ACL,
                                                                      shop on Treebanks and Linguistic Theories (TLT
    pp. 177–180, Prague, Czech Republic, June 2007. As-
                                                                      2007), pp. 163–174, Bergen, Norway, 2007.
    sociation for Computational Linguistics.
                                                                  22. Z. Žabokrtský and M. Popel: Hidden Markov tree
 9. K. Kos and O. Bojar: Evaluation of machine transla-
                                                                      model in dependency-based machine translation. In
    tion metrics for Czech as the target language. Prague
                                                                      Proceedings of the 47th Annual Meeting of the As-
    Bulletin of Mathematical Linguistics, 92, 2009.
                                                                      sociation for Computational Linguistics, 2009.
10. A. Lopez: A survey of statistical machine translation.
                                                                  23. Z. Žabokrtský, J. Ptáček, and P. Pajas. TectoMT:
    Technical Report, Institute for Advanced Computer
                                                                      Highly modular MT system with tectogrammatics used
    Studies, University of Maryland, 2007.
                                                                      as transfer layer. In Proceedings of the 3rd Workshop
11. D. Mareček, R. Rosa, P. Galuščáková, and O. Bojar:
                                                                      on Statistical Machine Translation, ACL, 2008.
    Two-step translation with grammatical post-processing.
    In Proceedings of the 6th Workshop on Statistical Ma-
    chine Translation, pp.426–432, Edinburgh, Scotland,
    2011. Association for Computational Linguistics.