Syntactic Translation Patterns from a Parallel Treebank

                                                                     Mihaela Colhon
                                                           University of Craiova, Romania
                                                         Departament of Computer Science
                                                        A.I.Cuza street, no. 13, code 200585
                                                                 mcolhon@inf.ucv.ro


ABSTRACT                                                                                  Machine translation based on syntactic trees has been ex-
The goal of the presented parallel phrase extraction algo-                             tensively studied in the last years due to the general need of
rithm is to provide rich and robust set of translation syn-                            improving the performance of the state-of-the-art PB-SMT
tactic patterns. To make this approach feasible, we consider                           [2].
the phrase-to-phrase alignments of a bilingual treebank an-                               Alignment of the parse trees can offer structural alignment
notated with syntactic constituents. For the intended pur-                             between two parallel sentences, more precisely, can help an
pose, the extracted phrasal nodes are encoded by the syn-                              experiment for testing the feasibility of the automatic cross-
tactical information of their components, highlighting some                            lingual transfer of syntactic constituents. Broadly speaking,
special constructs such as the functional words.                                       a transfer component is a system of rules that relate words
                                                                                       and structures in one language to words and structures of
                                                                                       another language (the target language).
Categories and Subject Descriptors                                                        Traditionally, phrases are taken to be syntactic constituents
I.2.7 [Artificial Intelligence]: Natural Language Process-                             of a sentence. Even if not all the words between two phrases
ing—Language parsing and understanding; I.2.6 [Artificial                              are aligned, the phrases can still align very well [11]. By
Intelligence]: Learning—Parameter learning                                             aligning the inner nodes of two parallel parse trees, the
                                                                                       phrases represented by these nodes are put in correspon-
                                                                                       dence as the subtrees of the syntactic analysis encode the
Keywords                                                                               structure of the represented syntactic phrases.
Parallel syntactic patterns, phrase-based translation                                     Such techniques have shown that starting with large syn-
                                                                                       tactic phrase tables and preferring syntactic phrases when
                                                                                       overlapping with non-syntactic ones allows the learning of
1.    INTRODUCTION                                                                     ”translation knowledge”. They show improvements in de-
   Parallel corpora can be used in order to generate extremely                         coding speeds and also improvement in translation quality
valuable linguistic knowledge such as they can support au-                             that results from the precision of these syntax motivated
tomatic identification of segments of texts that represent                             phrases [1].
reciprocal translations [13]. Two segments of texts from a                                Most of the phrases identified in the parse trees are ex-
bitext (parallel corpora) which represent reciprocal trans-                            pected to be translated without interleaving with other phrases
lations make a translation unit [13]. The translation units                            or words. In general, noun phrases tend to obey the above
that correspond to syntactic phrases can be used to generate                           rule in a much greater degree. At the opposite corner, the
other sentences in the target language of a Machine Transla-                           verb phrases usually suffer modifications in structure during
tion system: instead of generating translation of individual                           translation caused by the adjunct movement [4].
words in the source language, generate translations of the                                The goal of the presented algorithm for extracting parallel
phrases and assemble the final translation by a permutation                            syntactic patterns from a bilingual treebank is to generate
of these [14].                                                                         a set of good-quality translation patterns intended to be
   Methods for Machine Translation (MT) have increasingly                              learned by a statistical Syntax-based Machine Translation.
leveraged not only the formal machinery of syntax but also                                The presented parallel syntactic sequences were extracted
linguistic tree structures of either the source language, the                          from a treebank with syntactic constituents, an English-
target language or both. Phrase based statistical MT (PB-                              Romanian Treebank [5]. The treebank was built upon a par-
SMT) techniques for extracting phrases although not syn-                               allel English-Romanian corpus word-aligned and annotated
tactically motivated, enjoy a very high coverage [1]. Basic                            at the morphological and syntactic level. The syntactic trees
PB-SMT systems work with phrase pairs that are consistent                              of the Romanian texts are generated based on the syntactic
with the word alignment: the words of a phrase are contigu-                            phrases of the English parallel texts automatically obtained
ous strings consisting of words aligned to each other and not                          by means of a syntactic parser, the Standford Parser [12].
to words outside [8].                                                                  The Romanian trees generation mechanism reuses and ad-
                                                                                       justs existing tools and algorithms for cross-lingual transfer
BCI’12, September 16–20, 2012, Novi Sad, Serbia.                                       of syntactic constituents and syntactic trees alignment.
Copyright c 2012 by the paper’s authors. Copying permitted only for private and
academic purposes. This volume is published and copyrighted by its editors.
                                                                                          The treebank was constructed upon 1420 sentences from
Local Proceedings also appeared in ISBN 978-86-7031-200-5, Faculty of Sciences,        an English-Romanian parallel corpus developed at A.I.Cuza
University of Novi Sad.


                                                                                  85
                  Figure 1: A screen shot of the English-Romanian Syntactic Patterns Dataset.


University of Iaşi by the Natural Language Processing Group             The span of a node n of a syntactic tree is taken to be the
of Faculty of Computer Science. The corpus is XML encoded             subset of nodes that are reachable from n [9]. In a bottom-
obeying a simplified form of the XCES standard [10]. For              up fashion, the algorithm for extracting parallel syntactic
the bilingual corpus construction, the English and Roma-              patterns “visits” each English syntactic tree and expands
nian parts of the Acquis-Communitaire1 corpus were used.              all its inner nodes that are aligned with at least one node
   All the words of this English-Romanian corpus are anno-            from the Romanian parallel syntactic tree. The spans of the
tated with lemmas, morphosyntactic information (gender,               aligned English and Romanian phrasal nodes are taken to
number, person and case) and Part of Speech markers. The              be the parallel syntactic patterns of our study and therefore
tagsets used to annotate the words come from MULTEXT-                 are stored in a database (see Section 2.2). The method is
East morphosyntactic specifications (the latest version of            quick and easy enough to be used on large-scale data sets.
these specifications is given in [7]).                                   Here are some examples of the importance of parallel syn-
                                                                      tactic patterns from automatically learned translation rules
2.   PARALLEL PATTERNS WITH SYNTAC-                                   point of view:

     TIC CONSTITUENTS                                                    • simple lexical patterns for translating special words
                                                                           such as, functional words can be treated as examples
   Following the method of Galley described in [9], the phrase             of patterns in which optional modifiers are inserted
extraction process is supported by the parallel parse trees
of the constructed English-Romanian treebank. For each                   • patterns in which we found ”lexical holes” determined
alignment between inner nodes of the syntactic trees, the                  by existence of one-to-zero alignment mapping between
descendents of aligned nodes are examined. According to the                the words/tokens of the parallel sequences. For exam-
purpose for which the syntactic sequences are extracted, in                ple, English noun phrases that contain the word ”of”
the list of descendents, some specific words or constructions              as separator.
of certain structure can be highlighted.                                 • analyzing large sets of the parallel patterns, we can
   For the presented article, the syntactic sequences are in-              identify the“part of speech afinities”; it is usually known
tended to provide information about the manner in which                    that translated words tend to keep their part of speech
the functional words can affect translation. For this reason               but when this is not the case, the resulted part-of-
the functional words are given in the complete word-form ac-               speech for the translation is not random.
companied with complete information about their morpho-
syntactic properties.                                                    From the English-Romanian Parallel Treebank with syn-
   In any syntactic structure we can identify two major cat-          tactic Constituents, 2120 English-Romanian syntactic pat-
egories of words: content words which describe objects,               terns with functional word were extracted. The represen-
entities, properties, relationships or events and which are           tation in which the patterns are stored can provide good
syntactically represented by nouns, adjectives, verbs and             enough descriptions of the domain of locality for the func-
adverbs and functional words that help putting words                  tional words.
together in a correct structural sentence form. Also, the             2.1 Representation Formalism
functional words can tell how the other components of the               The representation for the English syntactic sequences
sentence are related to each other. The functional words can          with functional words is an ordered sequence of elements
be determiners, quantifiers, prepositions or connectives.             given in the following form:
1
 Acquis Communitaire corpus contains about 12,000 Roma-                  [ { Phrasal− Tag }∗ Pos− Tag/FW { Phrasal− Tag}∗ ]
nian documents and 6,256 parallel English-Romanian docu-
ments [6].                                                            where by F W we note a functional word.


                                                                 86
              (a) [ IN/at, NP ]                       (b) [ IN/by, NP ]                         (c) [ IN/for, NP ]

      Figure 2: Statistics for English patterns consisting of a preposition and a noun phrase ([IN, NP]).


   By parsing the English sentences with Stanford Parser,         (pronoun) such as Pd− (demonstrative pronoun), Ps− (pos-
PENN Treebank parse trees were generated. As a direct             sessive pronoun), Px− (reflexive pronoun), D− (determiner),
consequence, the English texts are annotated with PENN            T− (article), S− (adposition), C− (conjunction), Q− (par-
Phrasal tags as this is the tagging standard used by Stanford     ticle).
Parser. In this annotation formalism, the functional words           The representation for the Romanian syntactic sequences
for the English texts can be considered as sentence tokens        with functional words is an ordered sequence of elements
that in PENN POS tagset formalism have one of the follow-         given in the following form:
ing tags: CC (coordinating conjunction), DT (determiner) ,
                                                                     [ { Phrasal− Tag }∗ MSD Tag/FW { Phrasal− Tag}∗ ]
IN (preposition/ subordinating conjunction), MD (modal),
PRP (personal pronoun), PP$ (possessive pronoun), RP              where by F W we note a functional word and by M SD we
(particle), TO (word to), WDT (wh-determiner), WP (wh-            note the morphosyntactic descriptions encoded in MULTEXT-
pronoun), WP$ (possessive wh-pronoun), WRB (wh-adverb).           East morphosyntactic specifications.
  Here are some examples of English syntactic patterns:              Here are some examples of Romanian syntactic patterns:
• [NP, PRP, CC/and, NP] the syntactic phrase having               • [Di3-po—e/altor, NP] the syntactic phrase given by
this span is made of two noun phrases (NP) linked by a            this sequence contains a determiner (a MSD tag starting
personal pronoun (PRP) and a functional word, the coor-           with D) followed by a noun phrase
dinating conjunction and (in this specific order). The two        • [VP, Crssp/şi, Tsfs/a, NP] the syntactic phrase whose
syntactic phrases NP are not expanded because each of them        structure is encoded in this pattern is made of a verb phrase
has its own alignment, and thus, their structure is given in      (VP) and a noun phrase (NP) liked by a conjunction (a C−
other parallel syntactic sequence.                                MSD tag) and an article (a T− MSD tag).
• [RB, JJ, CC/and, JJ] the syntactic phrase having this
structure contains two adjectives (JJ) linked by a functional     2.2 Linguistic Resource with Syntactic Patterns
word (the conjunction and ) and preceded by an adverb (RB).          Each resulted parallel sequence is stored into a database
  Following the same representations, the corresponding Ro-       record with four fields (see Figure 1). The SynP hrase En
manian syntactic sequences are encoded in a similar format.       field stores the span of an English syntactic phrase, while
  The Romanian syntactic trees of the English-Romanian            in the SynP hrase RO field the span of the aligned Roma-
Treebank were automatically constructed by means of a bot-        nian syntactic phrase is given. The last two fields include
tom-up tree generation algorithm guided by the word align-        the PENN syntactic subtrees rooted at the aligned syntactic
ments of the corpus ([5]). As a consequence, the anno-            phrases.
tations for the Romanian words preserve the MULTEXT-                 Indeed, T reebank EN gives the bracket representation for
EAST words specifications of the corpus as these data in-         the subtree rooted at the English phrase while T reebank RO
clude enough morphosyntactic details needed in any syn-           is the subtree corresponding to the Romanian phrase. Ex-
tactic study, while for labeling the phrasal constituents, the    amples of some records of this linguistic resource are listed
PENN Treebank Phrasal tags are used.                              in Table 1.
  As a direct consequence, the Romanian functional words             Certain statistics about the translation of a particular En-
are those tokens/words that in MULTEXT-EAST Tagset                glish syntactic sequence into Romanian language can be eas-
formalism have MSD tags with the following prefixes: P−           ily obtained upon the constructed database table with the
                                                                  described information.


                                                             87
Table 1: Examples of English-Romanian Syntactic Patterns Together with Their Treebank Representations
     P hrase En    P hrase RO                T reebank EN                                  T reebank RO
     [IN/as, NP]   [Rw/cât, mai/Rp, ADJP]   [PP [IN Rsp/as] [NP [NP Afp/strict] [ADJP     [PP [Rw 14/cât] [Rp 15/mai] [ADJP
                                             [RB Cs/as] [JJ Afp/possible]]]]               [Afpfp-n 16/stricte] [ADJP [Rgp
                                                                                           17/posibil]]]]
     [IN/at, NP]   [Spsa/la, NP]             [PP [IN Sp/at] [NP [NP [DT Dd/the] [NN        [NP [Spsa 1/la] [NP [NP [Ncfsry
                                             Ncns/end]] [PP [IN Sp/of] [NP [DT Dd/the]     2/ı̂ncheierea]] [NP [Ncmsoy 3/exerciţi-
                                             [JJ Afp/financial] [NN Ncns/year]]]]]         ului] [ADJP [Afpms-n 4/financiar]]]]]


   From the statistics illustrated in Figure 2, one can ob-                 nlpw/nlpw/papers/Araujo_Caseli.pdf. Online;
serve that the translation in Romanian for the English syn-                 accessed 19-June-2012.
tactic pattern [IN/at, NP] do not change the order between              [3] A. Ceauşu. Rich morfo-syntactic description for
the noun phrase and the preceding preposition and replace                   factored machine translation with highly inflected
the preposition “at” with the Romanian preposition “la”.                    languages as target. In Workshop on Machine
The preposition “by” from the English pattern [IN/by, NP]                   Translation and Morphologically-rich Languages,
is equally translated with Romanian prepositions “de” and                   University of Haifa, 2011.
with the Romanian prepositional collocation “de− către”, while         [4] M. Colhon. A contrastive study of syntactic
the preposition “for” from the English sequence [IN/for, NP]                constituents in english and romanian texts. In Proc. of
is translated with the Romanian preposition “pentru”.                       the Workshop “Language Resources and Tools with
                                                                            Industrial Applications”, pages 11–20, 2011.
3.    CONCLUSIONS                                                       [5] M. Colhon. Language engineering for syntactic
   Statistical Machine Translation systems that use syntac-                 knowledge transfer (submitted). Computer Science
tical information in the translation process must be trained                and Information Systems Journal, 2012.
with syntactic patterns that correspond to reciprocal trans-            [6] D. Cristea and C. Forăscu. Linguistic resources and
lations in the languages of the MT system. Such training                    technologies for romanian language. Computer Science
can help the translation not only with the structural differ-               Journal of Moldova, 14(1(40)), 2006.
ences between the translations but also with the re-ordering            [7] T. Erjavec. Multext-east version 4: Multilingual
problems at the target sentence words [3].                                  morphosyntactic specifications, lexicons and corpora.
   Even if the lexical coverage of the used corpus, the Acquis              In Proc. of LREC’10. ELRA, 2010.
Communitaire corpus, is not representative, a MT system                 [8] M. K. G. Wenniger and K. Sima’an. A toolkit for
can still benefit from the translations similar in structure                visualizing the coherence of tree-based reordering with
and semantics that exist between the parallel sentences of                  word-alignments. In Proc. of the 5h MT-Marathon,
the corpus.                                                                 pages 97–104, 2010.
   Also the meanings of some special words, such as func-               [9] M. Galley, M. Hopkins, K. Knight, and D. Marcu.
tional words, can be easily explored by analysing the changes               What’s in a translation rule? In Proc. of HLT-NAACL
during the translation suffered by syntactic patterns consist-              2004, pages 273–280. ACL, Boston, USA, 2004.
ing of this kind of words. In the way it is constructed now,           [10] N. Ide, P. Bonhomme, and L. Romary. Xces: An
the resource focuses on the importance the functional words                 xml-based encoding standard for linguistic corpora. In
have in a translation process. But the syntactic patterns can               Proc. of the 2nd LREC, Paris: ELRA, 2000.
be generated in order to highlight other constructions, for            [11] R. Ion, R. Ceauşu, and D. Tufiş. Dependency-based
example, the polylexicals units of a natural language phrase.               phrase alignment. In Proc. of the 5th LREC, pages
   As a future work, we intend to enlarge the size of the                   1290–1293, 2006.
bilingual treebank in order to permit generation of a larger
                                                                       [12] D. Klein and C. D. Manning. Accurate unlexicalized
set of parallel syntactic patterns.
                                                                            parsing. In Proc. of the 41st Annual Meeting of ACL,
                                                                            pages 423–430, 2003.
4.    ACKNOWLEDGMENTS                                                  [13] D. Tufiş and R. Ion. Parallel corpora, alignment
  The author M. Colhon has been funded for this research                    technologies and further prospects in multilingual
by the strategic grant POSDRU/89/1.5/S/61968, Project                       resources and technology infrastructure. In Proc. of
ID 61986 (2009), co-financed by the European Social Fund                    the 4th SPED, 2007.
within the Sectorial Operational Program Human Resources               [14] K. Yamada and K. Knight. A syntax-based statistical
Development 2007-2013.                                                      translation model. In Proc. of ACL, pages 523–530,
                                                                            2001.
5.    REFERENCES
 [1] V. Ambati, A. Lavie, and J. Carbonell. Extraction of
     syntactic translation models from parallel data using
     syntax from source and target languages. In MT
     Summit XII, 2009.
 [2] J. G. Araùjo and H. M. Caseli. Alignment of
     portuguese-english syntactic trees using part-of-speech
     filters. http://www.cs.famaf.unc.edu.ar/~laura/


                                                                  88