Syntactic Translation Patterns from a Parallel Treebank Mihaela Colhon University of Craiova, Romania Departament of Computer Science A.I.Cuza street, no. 13, code 200585 mcolhon@inf.ucv.ro ABSTRACT Machine translation based on syntactic trees has been ex- The goal of the presented parallel phrase extraction algo- tensively studied in the last years due to the general need of rithm is to provide rich and robust set of translation syn- improving the performance of the state-of-the-art PB-SMT tactic patterns. To make this approach feasible, we consider [2]. the phrase-to-phrase alignments of a bilingual treebank an- Alignment of the parse trees can offer structural alignment notated with syntactic constituents. For the intended pur- between two parallel sentences, more precisely, can help an pose, the extracted phrasal nodes are encoded by the syn- experiment for testing the feasibility of the automatic cross- tactical information of their components, highlighting some lingual transfer of syntactic constituents. Broadly speaking, special constructs such as the functional words. a transfer component is a system of rules that relate words and structures in one language to words and structures of another language (the target language). Categories and Subject Descriptors Traditionally, phrases are taken to be syntactic constituents I.2.7 [Artificial Intelligence]: Natural Language Process- of a sentence. Even if not all the words between two phrases ing—Language parsing and understanding; I.2.6 [Artificial are aligned, the phrases can still align very well [11]. By Intelligence]: Learning—Parameter learning aligning the inner nodes of two parallel parse trees, the phrases represented by these nodes are put in correspon- dence as the subtrees of the syntactic analysis encode the Keywords structure of the represented syntactic phrases. Parallel syntactic patterns, phrase-based translation Such techniques have shown that starting with large syn- tactic phrase tables and preferring syntactic phrases when overlapping with non-syntactic ones allows the learning of 1. INTRODUCTION ”translation knowledge”. They show improvements in de- Parallel corpora can be used in order to generate extremely coding speeds and also improvement in translation quality valuable linguistic knowledge such as they can support au- that results from the precision of these syntax motivated tomatic identification of segments of texts that represent phrases [1]. reciprocal translations [13]. Two segments of texts from a Most of the phrases identified in the parse trees are ex- bitext (parallel corpora) which represent reciprocal trans- pected to be translated without interleaving with other phrases lations make a translation unit [13]. The translation units or words. In general, noun phrases tend to obey the above that correspond to syntactic phrases can be used to generate rule in a much greater degree. At the opposite corner, the other sentences in the target language of a Machine Transla- verb phrases usually suffer modifications in structure during tion system: instead of generating translation of individual translation caused by the adjunct movement [4]. words in the source language, generate translations of the The goal of the presented algorithm for extracting parallel phrases and assemble the final translation by a permutation syntactic patterns from a bilingual treebank is to generate of these [14]. a set of good-quality translation patterns intended to be Methods for Machine Translation (MT) have increasingly learned by a statistical Syntax-based Machine Translation. leveraged not only the formal machinery of syntax but also The presented parallel syntactic sequences were extracted linguistic tree structures of either the source language, the from a treebank with syntactic constituents, an English- target language or both. Phrase based statistical MT (PB- Romanian Treebank [5]. The treebank was built upon a par- SMT) techniques for extracting phrases although not syn- allel English-Romanian corpus word-aligned and annotated tactically motivated, enjoy a very high coverage [1]. Basic at the morphological and syntactic level. The syntactic trees PB-SMT systems work with phrase pairs that are consistent of the Romanian texts are generated based on the syntactic with the word alignment: the words of a phrase are contigu- phrases of the English parallel texts automatically obtained ous strings consisting of words aligned to each other and not by means of a syntactic parser, the Standford Parser [12]. to words outside [8]. The Romanian trees generation mechanism reuses and ad- justs existing tools and algorithms for cross-lingual transfer BCI’12, September 16–20, 2012, Novi Sad, Serbia. of syntactic constituents and syntactic trees alignment. Copyright c 2012 by the paper’s authors. Copying permitted only for private and academic purposes. This volume is published and copyrighted by its editors. The treebank was constructed upon 1420 sentences from Local Proceedings also appeared in ISBN 978-86-7031-200-5, Faculty of Sciences, an English-Romanian parallel corpus developed at A.I.Cuza University of Novi Sad. 85 Figure 1: A screen shot of the English-Romanian Syntactic Patterns Dataset. University of Iaşi by the Natural Language Processing Group The span of a node n of a syntactic tree is taken to be the of Faculty of Computer Science. The corpus is XML encoded subset of nodes that are reachable from n [9]. In a bottom- obeying a simplified form of the XCES standard [10]. For up fashion, the algorithm for extracting parallel syntactic the bilingual corpus construction, the English and Roma- patterns “visits” each English syntactic tree and expands nian parts of the Acquis-Communitaire1 corpus were used. all its inner nodes that are aligned with at least one node All the words of this English-Romanian corpus are anno- from the Romanian parallel syntactic tree. The spans of the tated with lemmas, morphosyntactic information (gender, aligned English and Romanian phrasal nodes are taken to number, person and case) and Part of Speech markers. The be the parallel syntactic patterns of our study and therefore tagsets used to annotate the words come from MULTEXT- are stored in a database (see Section 2.2). The method is East morphosyntactic specifications (the latest version of quick and easy enough to be used on large-scale data sets. these specifications is given in [7]). Here are some examples of the importance of parallel syn- tactic patterns from automatically learned translation rules 2. PARALLEL PATTERNS WITH SYNTAC- point of view: TIC CONSTITUENTS • simple lexical patterns for translating special words such as, functional words can be treated as examples Following the method of Galley described in [9], the phrase of patterns in which optional modifiers are inserted extraction process is supported by the parallel parse trees of the constructed English-Romanian treebank. For each • patterns in which we found ”lexical holes” determined alignment between inner nodes of the syntactic trees, the by existence of one-to-zero alignment mapping between descendents of aligned nodes are examined. According to the the words/tokens of the parallel sequences. For exam- purpose for which the syntactic sequences are extracted, in ple, English noun phrases that contain the word ”of” the list of descendents, some specific words or constructions as separator. of certain structure can be highlighted. • analyzing large sets of the parallel patterns, we can For the presented article, the syntactic sequences are in- identify the“part of speech afinities”; it is usually known tended to provide information about the manner in which that translated words tend to keep their part of speech the functional words can affect translation. For this reason but when this is not the case, the resulted part-of- the functional words are given in the complete word-form ac- speech for the translation is not random. companied with complete information about their morpho- syntactic properties. From the English-Romanian Parallel Treebank with syn- In any syntactic structure we can identify two major cat- tactic Constituents, 2120 English-Romanian syntactic pat- egories of words: content words which describe objects, terns with functional word were extracted. The represen- entities, properties, relationships or events and which are tation in which the patterns are stored can provide good syntactically represented by nouns, adjectives, verbs and enough descriptions of the domain of locality for the func- adverbs and functional words that help putting words tional words. together in a correct structural sentence form. Also, the 2.1 Representation Formalism functional words can tell how the other components of the The representation for the English syntactic sequences sentence are related to each other. The functional words can with functional words is an ordered sequence of elements be determiners, quantifiers, prepositions or connectives. given in the following form: 1 Acquis Communitaire corpus contains about 12,000 Roma- [ { Phrasal− Tag }∗ Pos− Tag/FW { Phrasal− Tag}∗ ] nian documents and 6,256 parallel English-Romanian docu- ments [6]. where by F W we note a functional word. 86 (a) [ IN/at, NP ] (b) [ IN/by, NP ] (c) [ IN/for, NP ] Figure 2: Statistics for English patterns consisting of a preposition and a noun phrase ([IN, NP]). By parsing the English sentences with Stanford Parser, (pronoun) such as Pd− (demonstrative pronoun), Ps− (pos- PENN Treebank parse trees were generated. As a direct sessive pronoun), Px− (reflexive pronoun), D− (determiner), consequence, the English texts are annotated with PENN T− (article), S− (adposition), C− (conjunction), Q− (par- Phrasal tags as this is the tagging standard used by Stanford ticle). Parser. In this annotation formalism, the functional words The representation for the Romanian syntactic sequences for the English texts can be considered as sentence tokens with functional words is an ordered sequence of elements that in PENN POS tagset formalism have one of the follow- given in the following form: ing tags: CC (coordinating conjunction), DT (determiner) , [ { Phrasal− Tag }∗ MSD Tag/FW { Phrasal− Tag}∗ ] IN (preposition/ subordinating conjunction), MD (modal), PRP (personal pronoun), PP$ (possessive pronoun), RP where by F W we note a functional word and by M SD we (particle), TO (word to), WDT (wh-determiner), WP (wh- note the morphosyntactic descriptions encoded in MULTEXT- pronoun), WP$ (possessive wh-pronoun), WRB (wh-adverb). East morphosyntactic specifications. Here are some examples of English syntactic patterns: Here are some examples of Romanian syntactic patterns: • [NP, PRP, CC/and, NP] the syntactic phrase having • [Di3-po—e/altor, NP] the syntactic phrase given by this span is made of two noun phrases (NP) linked by a this sequence contains a determiner (a MSD tag starting personal pronoun (PRP) and a functional word, the coor- with D) followed by a noun phrase dinating conjunction and (in this specific order). The two • [VP, Crssp/şi, Tsfs/a, NP] the syntactic phrase whose syntactic phrases NP are not expanded because each of them structure is encoded in this pattern is made of a verb phrase has its own alignment, and thus, their structure is given in (VP) and a noun phrase (NP) liked by a conjunction (a C− other parallel syntactic sequence. MSD tag) and an article (a T− MSD tag). • [RB, JJ, CC/and, JJ] the syntactic phrase having this structure contains two adjectives (JJ) linked by a functional 2.2 Linguistic Resource with Syntactic Patterns word (the conjunction and ) and preceded by an adverb (RB). Each resulted parallel sequence is stored into a database Following the same representations, the corresponding Ro- record with four fields (see Figure 1). The SynP hrase En manian syntactic sequences are encoded in a similar format. field stores the span of an English syntactic phrase, while The Romanian syntactic trees of the English-Romanian in the SynP hrase RO field the span of the aligned Roma- Treebank were automatically constructed by means of a bot- nian syntactic phrase is given. The last two fields include tom-up tree generation algorithm guided by the word align- the PENN syntactic subtrees rooted at the aligned syntactic ments of the corpus ([5]). As a consequence, the anno- phrases. tations for the Romanian words preserve the MULTEXT- Indeed, T reebank EN gives the bracket representation for EAST words specifications of the corpus as these data in- the subtree rooted at the English phrase while T reebank RO clude enough morphosyntactic details needed in any syn- is the subtree corresponding to the Romanian phrase. Ex- tactic study, while for labeling the phrasal constituents, the amples of some records of this linguistic resource are listed PENN Treebank Phrasal tags are used. in Table 1. As a direct consequence, the Romanian functional words Certain statistics about the translation of a particular En- are those tokens/words that in MULTEXT-EAST Tagset glish syntactic sequence into Romanian language can be eas- formalism have MSD tags with the following prefixes: P− ily obtained upon the constructed database table with the described information. 87 Table 1: Examples of English-Romanian Syntactic Patterns Together with Their Treebank Representations P hrase En P hrase RO T reebank EN T reebank RO [IN/as, NP] [Rw/cât, mai/Rp, ADJP] [PP [IN Rsp/as] [NP [NP Afp/strict] [ADJP [PP [Rw 14/cât] [Rp 15/mai] [ADJP [RB Cs/as] [JJ Afp/possible]]]] [Afpfp-n 16/stricte] [ADJP [Rgp 17/posibil]]]] [IN/at, NP] [Spsa/la, NP] [PP [IN Sp/at] [NP [NP [DT Dd/the] [NN [NP [Spsa 1/la] [NP [NP [Ncfsry Ncns/end]] [PP [IN Sp/of] [NP [DT Dd/the] 2/ı̂ncheierea]] [NP [Ncmsoy 3/exerciţi- [JJ Afp/financial] [NN Ncns/year]]]]] ului] [ADJP [Afpms-n 4/financiar]]]]] From the statistics illustrated in Figure 2, one can ob- nlpw/nlpw/papers/Araujo_Caseli.pdf. Online; serve that the translation in Romanian for the English syn- accessed 19-June-2012. tactic pattern [IN/at, NP] do not change the order between [3] A. Ceauşu. Rich morfo-syntactic description for the noun phrase and the preceding preposition and replace factored machine translation with highly inflected the preposition “at” with the Romanian preposition “la”. languages as target. In Workshop on Machine The preposition “by” from the English pattern [IN/by, NP] Translation and Morphologically-rich Languages, is equally translated with Romanian prepositions “de” and University of Haifa, 2011. with the Romanian prepositional collocation “de− către”, while [4] M. Colhon. A contrastive study of syntactic the preposition “for” from the English sequence [IN/for, NP] constituents in english and romanian texts. In Proc. of is translated with the Romanian preposition “pentru”. the Workshop “Language Resources and Tools with Industrial Applications”, pages 11–20, 2011. 3. CONCLUSIONS [5] M. Colhon. Language engineering for syntactic Statistical Machine Translation systems that use syntac- knowledge transfer (submitted). Computer Science tical information in the translation process must be trained and Information Systems Journal, 2012. with syntactic patterns that correspond to reciprocal trans- [6] D. Cristea and C. Forăscu. Linguistic resources and lations in the languages of the MT system. Such training technologies for romanian language. Computer Science can help the translation not only with the structural differ- Journal of Moldova, 14(1(40)), 2006. ences between the translations but also with the re-ordering [7] T. Erjavec. Multext-east version 4: Multilingual problems at the target sentence words [3]. morphosyntactic specifications, lexicons and corpora. Even if the lexical coverage of the used corpus, the Acquis In Proc. of LREC’10. ELRA, 2010. Communitaire corpus, is not representative, a MT system [8] M. K. G. Wenniger and K. Sima’an. A toolkit for can still benefit from the translations similar in structure visualizing the coherence of tree-based reordering with and semantics that exist between the parallel sentences of word-alignments. In Proc. of the 5h MT-Marathon, the corpus. pages 97–104, 2010. Also the meanings of some special words, such as func- [9] M. Galley, M. Hopkins, K. Knight, and D. Marcu. tional words, can be easily explored by analysing the changes What’s in a translation rule? In Proc. of HLT-NAACL during the translation suffered by syntactic patterns consist- 2004, pages 273–280. ACL, Boston, USA, 2004. ing of this kind of words. In the way it is constructed now, [10] N. Ide, P. Bonhomme, and L. Romary. Xces: An the resource focuses on the importance the functional words xml-based encoding standard for linguistic corpora. In have in a translation process. But the syntactic patterns can Proc. of the 2nd LREC, Paris: ELRA, 2000. be generated in order to highlight other constructions, for [11] R. Ion, R. Ceauşu, and D. Tufiş. Dependency-based example, the polylexicals units of a natural language phrase. phrase alignment. In Proc. of the 5th LREC, pages As a future work, we intend to enlarge the size of the 1290–1293, 2006. bilingual treebank in order to permit generation of a larger [12] D. Klein and C. D. Manning. Accurate unlexicalized set of parallel syntactic patterns. parsing. In Proc. of the 41st Annual Meeting of ACL, pages 423–430, 2003. 4. ACKNOWLEDGMENTS [13] D. Tufiş and R. Ion. Parallel corpora, alignment The author M. Colhon has been funded for this research technologies and further prospects in multilingual by the strategic grant POSDRU/89/1.5/S/61968, Project resources and technology infrastructure. In Proc. of ID 61986 (2009), co-financed by the European Social Fund the 4th SPED, 2007. within the Sectorial Operational Program Human Resources [14] K. Yamada and K. Knight. A syntax-based statistical Development 2007-2013. translation model. In Proc. of ACL, pages 523–530, 2001. 5. REFERENCES [1] V. Ambati, A. Lavie, and J. Carbonell. Extraction of syntactic translation models from parallel data using syntax from source and target languages. In MT Summit XII, 2009. [2] J. G. Araùjo and H. M. Caseli. Alignment of portuguese-english syntactic trees using part-of-speech filters. http://www.cs.famaf.unc.edu.ar/~laura/ 88