Keeping it Simple: Generating Phrase Structure Trees from a Hindi Dependency Treebank Himanshu Yadav† , Ashwini Vaidya‡ and Samar Husain‡ † Jawaharlal Nehru University, India ‡ Indian Institute of Technology Delhi, India yadavhimanshu059@gmail.com, ashwini.vaidya@gmail.com, samar@hss.iitd.ac.in Abstract Converting a treebank from one representation type to another poses sev- eral challenges [4] [3]. These challenges are contingent on (amongst other things) the information encoded in source representation and the informa- tion required in target representation. In this paper, we propose a conversion algorithm that converts the Hindi-Urdu Dependency Treebank (HUTB) to a Phrase Structure (PS) representation. In order to do this, we extract structural information (for projecting heads) as well as predicate-argument information from the dependency structure (DS). The resulting PS trees are relatively flat with few empty categories (ECs) and are very close to the DS trees in their syntactic content. Our algorithm generates ‘valid’ PS trees. The validity is based on certain metrics such as well-formedness, linearity, etc. These trees could be further transformed for theory specific analyses. 1 Introduction Most modern treebanks make use of either Dependency Structure (DS) or Phrase Structure (PS) representation to encode syntactic phenomenon. DS representation has been used more frequently for free word order languages like Czech, Hindi, Turkish, Russian etc. In addition, due to the development of efficient dependency parsing algorithms, use of dependency formalism for treebanking has become com- mon [10]. While DS representation offers several advantages, it is also known that certain types of structural constraints (e.g. c-command) that are based on con- stituent structures cannot be formalized in dependency structure. Such constraints are known to play an important role in accounting for various syntactic phenomena (e.g. reference resolution, syntactic islands, etc.) [11] (also see [6, 8]). In addi- tion, the PS representation lends itself naturally to capture the predictive nature of human sentence processing via left-corner phrase structure parsing algorithm [7]. It is therefore important to explore both these formalisms in order to investigate 123 their usefulness in language modeling and in building various applications. Since a treebanking task uses a single representation, one needs to transform the original representation to the desired representation. The complexity of a conversion algorithm will be contingent on (a) the infor- mation encoded in the source vs the information required in the target, (b) word order variation in the language, and (c) the choice of the target representation. These factors can lead to small or large divergence between the source and target representation. For example, Bhatt and Xia [4] required an intermediate layer in order to handle the information asymmetry between the source and target repre- sentation. Similarly, following Bhatt and Xia [4], Luu et al. [9] also introduced an additional intermediate layer to handle projectivity in their target PS. In this work, we propose a conversion algorithm1 that converts the DS tree in HUTB to the PS representation with the aim of maintaining ‘content of representation’ [12] between the source and target. The resulting trees are relatively flat with few empty categories (ECs), i.e. these PS trees are very close to the DS trees as far as syntactic information is concerned. Our paper is organized as follows: Section 2 discusses the salient properties of the PS representation scheme that is generated by the algorithm. In Section 3, we describe our conversion algorithm. Following this in section 4 we evaluate its performance. We conclude the paper in section 5 and discuss future directions. 2 The PS representation scheme Our proposed PS representation is relatively flat as it does not encode syntactic roles (like Subject, Object etc.) structurally. The PS trees also do not necessarily have binary branching or syntactically driven ECs. The flatness of the trees is a by-product of the following representation choices (a) No intermediate projection for heads, (b) No intermediate projection for dependents, (c) No binary branching. These choices, in fact, relate to the questions pertaining to head projection and dependent attachment raised by Xia and Palmer [14]. They point out three issues to be addressed by any conversion algorithm - (1) What kind of projection can a category (X) have? (2) How far should a dependent project before it attaches to its head? (3) To what position on a head’s projection chain2 , should a dependent’s projection attach? Regarding (1) and (2), as stated earlier, we do not have interme- diate projection for any head or dependent. Regarding (3), in our phrase structure, a dependent is always a sister of its head. This is illustrated in Figure 1. In addition, we do not have any structurally distinguished position for encoding syntactic roles. We capture this information using function tags like ‘–SUBJ’ for Subject, ‘-OBJ-1’ for direct object etc. These design considerations are motivated 1 The implemented algorithm can be downloaded from https://github.com/ yadavhimanshu059/Hindi_DS_to_PS_Conversion 2 A chain corresponds to all the intermediate projections of a category (X) upto its maximal pro- jection. 124 Figure 1: The choices taken in our algorithm regarding three questions raised by Xia and Palmer [14] by the aim of maintaining ‘content of representation’ [12] between the source and target. Interestingly, Collins et al. [5] have also argued for flat PS representation scheme for Czech, which is also a free word order language like Hindi. On the other hand, PS representation scheme for Hindi by Bhatt et al. [2] maintains fixed structural positions for core arguments etc. leading to multiple empty categories. They con- sistently assume a binary branching representation while maintaining a linear word order. It is easy to see that getting to a representation proposed by [2] is difficult because of the information asymmetry and the divergent analyses of various phe- nomena in DS and PS [3]. Of course, we are not claiming that a more detailed PS scheme for Hindi (e.g. Bhatt et al. [2]) will not be useful at all. Indeed, from a theoretical perspective, PS trees with elaborate empty categories might be very useful. In fact, we envisage our PS schema (that is closer to DS) to be a useful resource in order to reach such a representation. 3 DS to PS conversion algorithm As stated earlier, our conversion algorithm does not use any additional information other than the DS. We extract modifier-modified information (for projecting heads) as well as predicate-argument information from the DS. We illustrate our conversion algorithm with the example in Figure 2 below. The algorithm applies the following rules (i) rules for head-projection (ii) rules for argument-adjunct distinction (iii) rules for joining argument structure sub-trees (iv) rules for joining clauses in case of multi-clause constructions. 125 Figure 2: DS to PS conversion steps for example 1. In Step-1, H*1, H*2 etc. are the computed heads from the DS. In Step-2, these heads are projected recursively to form phrase fragments. Dashed boxes show a set of phrase fragments. In Step- 3, we extract dependency relation information from DS and use the set of phrase fragments from Step-2 as input to form argument structure subtrees. These subtrees are combined in Step-4 to get the clausal PS tree. 126 Illustrative Example (see Figure 2) (1) Mohan=kii baRI bahan shyaama aajkal Mohan.N.m.sg=Gen elder.adj sister.N.M.sg Shyama.N.f.sg nowadays.adv puranii kitaabein paRh-rahii-hai old.adj.f.pl book.f.pl read-prog.f.sg-be ‘Mohan’s elder sister Shyama is reading old books these days’ We use the sentence shown in Example 1 to illustrate our conversion process. In step-1 (see Figure 2), we use the modifier-modified relation from the DS tree to compute the heads of phrases in the corresponding PS tree. The modified element takes the head position in phrases. The heads are marked with H* in Figure 2. In step-2, the heads extracted from DS are recursively projected to generate phrase- fragments. We use the part-of-speech (POS) tag information from DS at this level. Thus, a head with POS tag X projects to XP in phrase structure. The modifiers of a head in DS take the non-head positions in a minimal phrase fragment. In step-3, we use the dependency relation label information of the modifiers of the predicate in DS in order to make the distinction between arguments and ad- juncts. The following dependency relation to syntactic role mappings were used: k1 – subject, k2/k2p – object, etc. Previous attempts to extract argument infor- mation from the DS have shown that this mapping can correspond to predicate- argument information [1, 13]. Once the argument-adjunct distinction has been made, we then create sub-trees by adjoining the phrase fragments corresponding to NP-SUBJ, NP-OBJ etc. This means that syntactic roles such as subject and object are not encoded structurally; rather this information is encoded through the phrase tag. We get the complete PS tree by joining sub-trees as shown in step-4. If a sentence is composed of two or more clauses, each clause is formed using the same procedure shown as in Figure 2. We then adjoin these clauses at a clause joining step. 3.1 Basic Argument Structure Dependency Label Phrase-Tag k1 -SUBJ k2, k2p -OBJ-1 k4, k2g -OBJ-2 k4a -SUBJ-Dat pk1 -SUBJ jk1 -J-SUBJ k2s -OBJ-Comp Table 1: Dependency label to Phrase-Tag mapping As stated earlier, we use phrase tags instead of structural encoding for argument structure information. In a minimal clause, arguments licensed by verb are repre- 127 sented using phrase tags like -SUBJ, -OBJ-1, -OBJ-2 etc (see Table 1). It makes our resulting trees flatter and non-binary branched, see Figure 3. Previous attempts to extract argument information from the DS have shown that this mapping can correspond to predicate-argument information [1, 13]. Figure 3: A DS and the derived PS showing basic argument structure for the sen- tence ‘Mr. X was playing cards with friends at home.’ 3.2 Non-finite Clause The Hindi-Urdu Treebank (HUTB) makes a distinction between finite verbs and non-finite verbs. We maintain predicate-argument information in a minimal non- finite clause using phrase tags -SUBJ, -OBJ-1, -SUBJ-Dat etc. Although we do not posit any structural preferences for a non-finite clause, we encode a distinction between finite and non-finite clause using phrase names S and S-NF/S-NN respec- tively. An example non-finite clause is shown in Figure 4. Our system does not add empty categories for shared subjects. If there is any empty node in DS, then, we maintain it in resulting PS. 128 Figure 4: A DS and the derived PS showing a non-finite clause for the sentence ‘After assuring the people they cleared up the traffic.’ 3.3 Relative Clause Similar to the DS representation, relative clauses directly modify the nouns in the PS tree (see Figure 5). We use the modifier-modified relation information from DS to identify the head of the relation clause. The S node in the relative clause tree appears as a sister of the head noun in the PS tree. 3.4 Multi-clause Coordination Two clauses at the same level are adjoined to the coordinating conjunction such that both the clauses take the coordinating conjunction as their head (Figure 6). 129 Figure 5: A DS and the derived PS showing an embedded relative clause for the sentence ‘Jagdish’s younger brother who studies in Delhi came yesterday’ Figure 6: A DS and the derived PS showing a multi-clause coordination for the sentence ‘For the first time literacy has increased significantly and the number of illiterates is going down.’ 130 4 Evaluation We evaluate the validity of the automatically generated PS trees using various met- rics. This was done to ensure that the information transfer from DS to PS was in- deed successful. In addition we wanted to test the structural consistency of the PS representation. This is necessary because we do not have gold PS trees, therefore we have to use quantitative methods to analyze the performance of our system. The evaluation was performed on Hindi Dependency Treebank (ver-0.05) which consists of 18853 sentences with an average of 20 words per sentence.3 We used the following metrics (1) Well-formedness: the tree has one root, is acyclic and has no duplicated sub-trees (2) Linear order: terminal nodes in the tree match the linear word order in sentence (3) Argument representation: All the arguments given in DS are reflected in the PS output (4) Clausal Correspondence: whether all verbal predicates in DS have their corresponding S nodes in PS. The first two tests validate the correctness of the constituent structure while the latter two tests analyze the performance of our algorithm in extracting information from the DS and representing the same information on the PS. Well-formedness Linear Argument Clausal All order representation correspondence constraints 99.90% 99.78% 99.87% 99.96% 99.74% Table 2: System evaluation Table 2 shows the performance of our conversion algorithm on all the metrics. We find that we are able to convert the DS trees into PS trees with high accuracy. The high accuracy demonstrates that the conversion produces PS trees that can be reliable enough for other applications. We note that our system is able to success- fully convert nearly all types of constructions in the HUTB into PS. We also exper- imented with handling non-projectivity. We were successful in handling a majority of such dependencies. However the resulting PS structures were not always consis- tent with regards to the placement of the empty category. In addition, certain cases of inter-clausal non-projective dependencies were not being converted. In order to handle these issues consistently across the treebank, we chose to not handle non- projectivity in the current release of the implemented algorithm. We do plan to release an updated version that can handle non-projectivity in the near future. 5 Summary and Conclusion In this paper, we have proposed a DS to PS conversion algorithm for the HUTB. The goal of our system was to maintain a close correspondence between the syn- tactic information in DS and PS representations. This algorithm generates PS 3 We found 0.2% dependency trees in the treebank as either disjoint or cyclic which we ignored for the purposes of this evaluation. Non-projective trees were also not considered. 131 trees with high accuracy while avoiding an intermediate level of representation. We envisage that this resource will be immediately useful for NLP tools such as parsers and also for language modeling. Our system currently does not handle non-projectivity. In addition, we do not specifically handle complex predicates. We intend to handle these phenomena in the near future. References [1] Bharat R Ambati, Tejaswini Deoskar, and Mark Steedman. Hindi CCGbank: CCG treebank from the hindi dependency treebank. In Language Resources and Evaluation, 2016. [2] Rajesh Bhatt, Bhuvana Narasimhan, Martha Palmer, Owen Rambow, Dipti Misra Sharma, and Fei Xia. A multi-representational and multi-layered treebank for hindi/urdu. In Proceedings of the Third Linguistic Annotation Workshop, pages 186–189. Association for Computational Linguistics, 2009. [3] Rajesh Bhatt, Owen Rambow, and Fei Xia. Linguistic phenomena, analy- ses, and representations: Understanding conversion between treebanks. In IJCNLP, pages 1234–1242, 2011. [4] Rajesh Bhatt and Fei Xia. Challenges in converting between treebanks: a case study from the hutb. In META-RESEARCH Workshop on Advanced Treebank- ing, page 53, 2012. [5] Michael Collins, Lance Ramshaw, Jan Hajič, and Christoph Tillmann. A statistical parser for czech. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics, pages 505–512. Association for Computational Linguistics, 1999. [6] Felix Engelmann, Lena A. Jäger, and Shravan Vasishth. The determinants of retrieval interference in dependency resolution: Review and computational modeling. Manuscript submitted, 2016. [7] John T Hale. What a rational parser would do. Cognitive Science, 35(3):399– 443, 2011. [8] Philip Hofmeister and Ivan A Sag. Cognitive constraints and island effects. Language, 86(2):366, 2010. [9] Alex Luu, Sophia A Malamud, and Nianwen Xue. Conversion of synta- grus dependency treebank into penn treebank style. In Proceedings of the Tenth Linguistic Annotation Workshop, pages 16–21. Association for Com- putational Linguistics, 2016. 132 [10] Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, et al. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), 2016. [11] C. Phillips, M. W. Wagers, and E. F. Lau. Grammatical illusions and selective fallibility in real-time language comprehension. In J. Runner (ed.), Experi- ments at the Interfaces, Syntax and Semantics, volume 37, pages 153–186. 2011. [12] Owen Rambow. The simple truth about dependency and phrase structure rep- resentations: An opinion piece. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 337–340. Association for Computational Linguistics, 2010. [13] Ashwini Vaidya, Jinho D Choi, Martha Palmer, and Bhuvana Narasimhan. Analysis of the hindi proposition bank using dependency structure. In Pro- ceedings of the 5th Linguistic Annotation Workshop, pages 21–29. Associa- tion for Computational Linguistics, 2011. [14] Fei Xia and Martha Palmer. Converting dependency structures to phrase structures. In Proceedings of the first international conference on Human lan- guage technology research, pages 1–5. Association for Computational Lin- guistics, 2001. 133