J. Yaghob (Ed.): ITAT 2015 pp. 23–29 Charles University in Prague, Prague, 2015 Free or Fixed Word Order: What Can Treebanks Reveal? Vladislav Kuboň and Markéta Lopatková Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics Malostranské nám. 25, Prague 1, 118 00, Czech Republic {lopatkova,vk}@ufal.mff.cuni.cz Abstract: The paper describes an ongoing experiment Russian and other Slavic languages are the so-called lan- consisting in the attempt to quantify word-order proper- guages with a high degree of word order freedom, they still ties of three Indo-European languages (Czech, English and stick to the same order of word in a typical (unmarked) German). The statistics are collected from the syntacti- sentence. As for the VSO-type languages, their represen- cally annotated treebanks available for all three languages. tatives can be found among semitic (Arabic, classical He- The treebanks are searched by means of a universal query brew) or Celtic languages, while (some) Amazonian lan- tool PML-TQ. The search concentrates on the mutual or- guages belong to the OSV type. These characteristics, der of a verb and its complements (subject, object(s)) and which are traditionally mentioned in classical textbooks the statistics are calculated for all permutations of the three of general linguistics [5], have been specified on the basis elements. The results for all three languages are compared of excerptions and careful examination by many linguists. and a measure expressing the degree of word order free- Today, when we have at our disposal a wide range of dom is suggested in the final section of the paper. linguistic data resources for tens of languages, we can eas- This study constitutes a motivation for formal modeling ily confirm (or enhance by quantitative clues) their con- of natural language processing methods. clusions. This paper represents one of the steps in this direction. The Institute of Formal and Applied Linguistics at the 1 Introduction Charles University in Prague, has established a repository for linguistic data and resources LINDAT/CLARIN1 . This General linguistics, see esp. [1, 2] studies natural lan- repository enables experiments with syntactically anno- guages from the point of view of similarities and differ- tated corpora, so called treebanks, for several tens of lan- ences in their syntactic structure, their development and guages. Wherever it is possible due to the license agree- historical changes, as well as from the point of view of ments, the corpora are trasformed into a common format, language functions. It studies mutual influence of partic- which enables – after a very short period of getting ac- ular groups of features and, on the basis of similarities of quainted with each particular treebank – a comfortable language phenomena it introduces the so called language search and analysis of the data from a particular language. typology [3, 4]. The freedom or, on the other hand, strict- The HamleDT2 (HArmonized Multi-LanguagE Depen- ness of the word order definitely belongs among the most dency Treebank) project has already managed to transform important phenomena. General linguistics, for example, more than 30 treebanks from all over the world [6] into studies whether and how a particular language handles the a common format. order of words in sentences – whether the word is deter- In this pilot study we concentrate on three Indo- mined primarily by syntactic categories (e.g., a noun or a european languages which substantially differ by the de- pronoun, without any additional morphological signs, lo- gree of word freedom – Czech, German and English. We cated on the first sentential position represents a subject investigate their typological properties on the basis of the in English), or whether syntactic categories are primarily Prague Dependency Treebank [7], the English part of the determined by other means than by the word order (for ex- Prague Czech-English Dependency Treebank[8] and the ample, in Slavic languages, the subject tends to be a noun German treebank TIGER [9] by means of the interface of in the nominative case, regardless of its position in the sen- PML-TQ Tree Query [10], which enables the access to the tence). treebanks from the HamleDT.3 Particular natural languages cannot be, of course, strictly characterized by a single feature (for example word order), they are typically categorized into individual lan- 2 Setup of the Experiment guage types by a mixture of characteristic features. If we concentrate on word order, we study the prevalent order of The analysis of syntactic properties of natural languages the verb and its main complements – indo-european lan- constitutes one of our long term goals. The phenomenon guages are thus characterized as SVO (SVO reflecting the of word order has been in a center of our investigations order Subject, Verb, Object) languages. English and other 1 https://lindat.mff.cuni.cz/cs/ languages with a fixed word order typically follow this 2 http://ufal.mff.cuni.cz/hamledt order of words in declarative sentences; although Czech, 3 https://lindat.mff.cuni.cz/services/pmltq/ 24 V. Kuboň, M. Lopatková for a very long time. Our previous investigations concen- part of) the Prague Czech English Dependency Treebank trated both on studying individual properties of languages (PCEDT)6 [8] – within this project, syntactically anno- with higher degree of word-order freedom (as, e.g., non- tated Penn Treebank7 [13] was automatically transformed projective constructions (long-distance dependencies) [11] from the original phrase-structure trees into the depen- as well as on the endeavor to find some general measures dency annotation.8 Based on this experience, the Ham- enabling to more precisely characterize concrete natural leDT initiative goes further, syntactically annotated cor- languages with regard to the degree of their word-order pora for different languages are collected and transferred freedom (see, e.g. [12]). into the common format. Here we make use of the TIGER The experiment presented in this paper continues in the corpus9 for the German language [9], the corpus with na- same direction. It is driven by the endeavor to find an ob- tive phrase-structure annotation enriched with the infor- jective way how to compare natural languages from the mation about the head for each phrase (and thus bearing point of view of the degree of their word-order freedom. also information on dependencies). Figures 2, 6 and 7 While the previous experiments concentrated on more for- show sample trees for Czech, English and German, respec- mal approach, this one builds upon a thorough analysis of tively, and Table 1 summarizes the size of these corpora. available data resources. Let us briefly introduce them in the subsequent subsections. corpus # preds lang type genre When investigating syntactic properties of natural lan- PDT 79,283 Czech manual news guages, it is very often the case that the discussion concen- PCEDT 51,048 English automatic economy trates on individual phenomena, their properties and their TIGER 36,326 German automatic news influence on the order of words. The mere presence of some phenomenon (or its more detailed properties) is, of Table 1: Overview of all three treebanks (# preds repre- course, important and definitely influences the degree of sents the number of predicates in the given corpora) word-order freedom but this kind of investigation cannot be complete without stating also the quantitative proper- ties of the given phenomenon. A linguistically interesting, 2.2 HamleDT and PMLTQ Tree Query but marginal phenomenon does not tell us so much as a ba- sic phenomenon occurring relatively frequently. This ob- For searching the data, we exploit a PML-TQ search servation constitutes the basis of our current experiment. tool,10 which has been primarily designed for processing In order to capture the quantitative characteristic of a nat- the PDT data. PML-TQ is a query language and search en- ural language, let us take a representative sample of its gine designed for querying annotated linguistic data [10] syntactically annotated data and let us calculate the distri- – it allows users to formulate complex queries on richly bution of individual types of word order for the three main annotated linguistic data. syntactic components – subject, predicate and object. It is Having the treebanks in the common data format, the obvious that the more free is the word order of a given lan- PML-TQ framework makes it possible to analyse the data guage, the more equally they are going to be distributed. in a uniform way – the following sample query gives us trees with an intransitive predicate verb (in a main clause), i.e. Pred node with Sb node and no Obj nodes 2.1 Available Treebanks among its dependent nodes, where Sb follows the Pred; the filter on the last line (>> for $n0.lemma give $1, The extensive quantitative analysis of the same linguistic count() ) outputs a table listing verb lemmas with this phenomenon for different languages would not be feasi- marked word order position and number of their occur- ble without a common platform which makes it possible rences in the corpus, see also Figure 1. to compare various data resources from the same point of view. Thanks to the initiative HamleDT4 (HArmonized a-node $n0 := Multi-LanguagE Dependency Treebank) it is now possi- [ afun = "Pred", ble to compare the data from more than 30 languages in child a-node $n1 := a uniform way [6]. [ afun = "Sb", $n1.ord > $n0.ord ], The HamleDT family of treebanks is based on the de- 0x child a-node pendency framework and technology developed for the [ afun = "Obj"]] Prague Dependency Treebank (PDT)5 [7], i.e., large syn- >> for $n0.lemma give $1, count() tactically annotated corpus for the Czech Language. Here 6 http://ufal.mff.cuni.cz/pcedt2.0/cs/index.html we focus on the so-called analytical layer, i.e., the layer 7 https://www.cis.upenn.edu/ treebank describing surface sentence structure (relevant for study- 8 This dependency-based surface annotation then served as a basis for ing word order properties). The framework and its deep syntactic dependency-based annotation of English; however, as for language independence was verified within (the English Czech, only surface structure is interesting for the studied phenomenon of word order. 9 http://www.ims.uni-stuttgart.de/forschung/ 4 http://ufal.mff.cuni.cz/hamledt ressourcen/korpora/tiger.html 5 http://ufal.mff.cuni.cz/pdt3.0 10 https://lindat.mff.cuni.cz/services/pmltq/ Free or Fixed Word Order: What Can Treebanks Reveal? 25 Figure 1: Visualization of the PML-TQ query 3 Analysis of Data Let us now look at the syntactic typology of natural lan- guages under investigation. We are going to take into account especially the mutual position of subject, predi- Figure 2: Sample Czech dependency tree from PDT cate and direct object. After a thorough investigation of the ways how indirect objects are annotated in all three corpora, we have decided to limit ourselves – at least in this stage of our research – to basic structures and to ex- tract and analyse only sentences without too complicated or mutually interlocked phenomena. Namely we focus on sentences with the following properties: • A predicate under scrutiny belongs to the main clause (as e.g. in the sentence JsouPred vám nejasná některá ustanovení daňových zákonů? ‘ArePred certain pro- visions of the tax laws unclear to you?’, see the de- pendency tree in Fig. 2); i.e., we do not analyse word order of dependent clauses; • We analyse only non-prepositional subjects and ob- jects (compare e.g. with the sentence V 2180 městech a obcích žije na 2.6 milionu obyvatelSb ; ‘There are (about 2.6 milion of inhabitants)Sb living in 2 180 towns and villages;’, see Fig. 3); • Sentences may contain coordinated predicates (as, e.g., predicates následoval and opakovalo in the cor- pus sentence Vzápětí následovalPred další regulační stupeň a vše se opakovaloPred . ‘The next level of regulation immediately followedPred and everything Figure 3: Sample Czech dependency tree from PDT with repeatedPred again.’, see Fig. 4); prepositional subject (excluded from the resulting tables) However, sentences with common subjects (or ob- jects) are not taken into account (thus sentences as, 3.1 Czech e.g., KoupelnaSb nebo teplá vodaSb nejsou trvale k The highest quality syntactically annotated Czech data can dispozici. ‘A bathroomSb or hot water supplySb are be found in the Prague Dependency Treebank; in fact, it not at the permanent disposal.’, see Fig. 5 are not is the only corpus we work with that has been manually counted in the tables).11 annotated and thoroughly tested for the annotation con- 11 Including coordination phenomena in all their complexity would sistency. The texts of PDT belong mostly to the journal- require much robust queries in any dependency framework; thus we have ism genre, it consists of newspaper texts and (in a limited decided to disregard this type of sentences at all. scale) of texts from a popularizing scientific journal. 26 V. Kuboň, M. Lopatková Word order type Number % SV 16,909 56.66 VS 12,932 44.34 Total 29,841 100.00 Table 2: Sentences with intransitive verbs It is not surprising that the unmarked – intuitively "most natural" – word order type, SVO, accounts for only slightly more than half of cases. The relatively high degree of word order freedom is thus supported also quantitatively. Word order type Number % SVO 11,158 52.42 SOV 1,533 7.20 VSO 1,936 9.10 VOS 2,136 10.04 OVS 4,001 18.80 OSV 521 2.45 Total 21,285 100.00 Figure 4: Sample Czech dependency tree from PDT with coordinated predicates (included in the resulting tables) Table 3: Sentences with a single object Even more interesting (and also supporting the claim that the word order freedom of Czech is relatively high) are the results for sentences with at least two objects. They are summarized in Table 4. The distribution is even flatter than in Table 3 with all types being represented (even those starting with two objects, see the following example) and none of them exceeding 30%. Plán mu v úterý předložil velvyslanec USA v Chorvat- sku Peter Galbraith. Word order type Number % SVOO 293 26.95 SOVO 223 20.52 SOOV 33 3.04 VSOO 45 4.14 VOSO 16 1.47 VOOS 27 2.48 OSVO 70 6.44 OSOV 10 0.92 Figure 5: Sample Czech dependency tree from PDT with OOSV 15 1.38 coordinated subject (excluded from the resulting tables) OOVS 124 11.41 OVSO 78 7.18 OVOS 153 14.08 The following Table 2 summarizes the number of sen- Total 1,087 100.00 tences with intransitive verbs in main clauses in PDT with respect to the word order positions of Sb and Pred – we Table 4: Sentences with two objects can see that the marked word order (verb preceding its sub- ject) is quite common in Czech.12 The second table displays the distribution of individual combinations of a subject, predicate and a single object. 3.2 English 12 In our settings, we do not checked the part of speech of the predi- The statistics concerning the distribution of word-order cate; however, out of the 79,283 sentences conforming to the properties types for English have been calculated on the English mentioned above, only 329 have other than verbal predicate. part of the Prague Czech English Dependency Treebank Free or Fixed Word Order: What Can Treebanks Reveal? 27 (PCEDT). This corpus actually contains the same set of were represented less than 10 times. In total, 23 verbs ap- sentences as the Wall Street Journal section of Penn Tree- pear in these sentences at least twice, out of them 16 can bank,13 (see above for references) but unlike its predeces- be classified as verbs of communication (verba dicendi) sor, its syntactic structure has been annotated using depen- (in total, it means 678 occurrences out of 822, i.e., 82,5 % dency trees. As was mentioned above, the transformation of all occurrences with at least two hits in the corpus). on the surface syntactic layer was fully automatic, which The results for sentences containing one object also has of course affected the quality of annotation. strongly confirm the fact that the order Subject - Predicate - Object (SVO) is practically the only acceptable order in standard sentences. The remaining types of word order (representing only 1.06% sentences in the corpus) men- tioned in Table 6 actually represented annotation errors in a vast majority of cases (esp. auxiliary verbs which have been quite often incorrectly annotated as Objects). Word order type Number % SVO 12,481 98.94 SOV 77 0.61 VSO 9 0.07 VOS 1 0.01 OVS 2 0.02 OSV 45 0.36 Total 12,615 100.00 Table 6: English sentences with a single object It turns out that for English, it does not make sense to construct a similar table as Table 4 sentences with more Figure 6: Sample English dependency tree from PCEDT than one object. The automatic annotation of PCEDT is, unfortunately, biased in what should be considered an Ob- The statistics of different types of word order have been ject (in the original Penn Treeank annotation, the verbal collected in the same manner as in the previous subsec- complements are labeled just as noun (or prepositional) tion. We have also applied identical filters as for Czech phrases (NPs and PPs), no distinction between Objects and sentences from PDT. Table 5 contains data for sentences Adverbials.) As a consequence, adverbial constructions with intransitive verbs. Only as few as 40 sentences have are very often incorrectly annotated as Objects and thus it other than verbal predicate. is impossible to rely on this distinction (and the analysis shows that the numbers would be highly misleading). Word order type Number % SV 28,236 96.91 VS 900 3.09 3.3 German Total 29,136 100.00 German has more constraints on word order than Czech and less than English, therefore it constitutes a very nat- Table 5: English sentences with intransitive verbs ural candidate for our experiment. On top of that, there are also numerous high quality resources which can be ex- As we can see, the strict word order of English sen- ploited. We have used the German treebank conforming tences manifests itself in a vast majority of sentences hav- to the HamleDT initiative, which is located in the Lindat ing the prototypical word order of the subject being fol- repository.14 lowed by a predicate. The examples of the opposite word The statistics for German were collected in the same order include sentences containing direct speech with the way and with the same constraints as Czech and English following pattern: ones. The statistics for German sentences with intransitive "It’s just a matter of time before the tide turns," says one predicates are presented in Table 7. Midwestern lobbyist. The almost equal number of sentences with SV and Out of the 900 sentences with the reversed word order, VS word order types is quite surprising. The fact that as many as 630 contained the predicate to say, 121 to SV represents the typical word order in declarative sen- be. Each of all other verbs involved in these constructions tences, while VS in interrogative ones provides an obvi- ous explanation. Unfortunately, this explanation does not 13 The Czech part had been created as translation of original English sentences. 14 https://lindat.mff.cuni.cz/services/pmltq/hamledt_dt_de/ 28 V. Kuboň, M. Lopatková 4 Proposed Measure of Word Order Freedom The statistics presented in the previous section actually confirm the well known fact that Czech has the highest degree of word order freedom from all three languages in- vestigated in our experiment. This fact is also reflected in the chart 8 comparing the results for sentences with one object for all three languages. 100 80 60 40 20 English Figure 7: Sample German dependency tree from Ham- 0 SVO German leDT SOV VSO VOS Czech OVS OSV Word order type Number % SV 6,165 56.67 Figure 8: Comparison of results VS 4,713 43.33 Total 10,878 100.00 Let us now try to suggest a formula which might allow to express the degree of word order freedom in a more Table 7: German sentences with intransitive verbs precise way. Intuitively, the more free is the word order, the more equally distributed should be the results of all six word order types. The more strict the word order, the cover all occurrences because the analyzed corpus (con- more distant are the values from the ideal (equal distri- sisting mostly of newspaper texts) contains only a very bution). This leads directly to the application of a least small proportion of interrogative sentences. We have not squares method: investigated the reason for the surprisingly high number of v u6 VS sentences, but it definitely constitutes a very interest- 1u ing topic for further research. The same is valid also for M = t ∑ (Vi − Av)2 , (1) 6 i=1 the results contained in Table 8, where we have found rel- atively high number of sentences having the word order of where M is the proposed measure, Vi the percentual an interrogative sentence, too. value of the i-th word order type and Av is the average percentage for each word type (i.e., 100/6). For the three languages in our experiment we then get the following val- Word order type Number % ues: SVO 10,662 50.31 SOV 193 0.91 • Czech: 6.82 VSO 7,425 35.04 • German: 19.20 VOS 690 3.26 OVS 2,206 10.41 • English: 36.79 OSV 15 0.07 Total 21,191 100.00 These values seem to correspond to the intuitive feel- ing that the word order order of English is really strongly Table 8: German sentences with a single object fixed, while German and Czech have more free word order with Czech having the highest degree of word order free- dom. If we express the results in the form of percentages Neither for German we have investigated the sentences of the absolutely fixed word order (i.e., one of the word or- with two or more objects due to annotation inconsisten- der types accounts for 100% and all others do not appear cies. at all), we’ll get the following results: Free or Fixed Word Order: What Can Treebanks Reveal? 29 • Czech: 18.31% [5] Čermák, F.: Jazyk a jazykověda. Pražská imaginace, Praha (1994) • German: 51.52% [6] Zeman, D., Dušek, O., Mareček, D., Popel, M., Ra- masamy, L., Štěpánek, J., Žabokrtský, Z., Hajič, J.: Ham- • English: 98.73% leDT: Harmonized multi-language dependency treebank. Language Resources and Evaluation 48 (2014), 601–637 [7] Hajič, J., Panevová, J., Hajičová, E., Sgall, P., Pajas, P., 5 Conclusions Štěpánek, J., Havelka, J., Mikulová, M., Žabokrtský, Z., Ševčíková-Razímová, M.: Prague Dependency Treebank The experiment described in this paper brought several in- 2.0. LDC, Philadelphia, PA, USA (2006) teresting results which may be taken as a basis for further [8] Hajič, J., Hajičová, E., Panevová, J., Sgall, P., Bo- experiments. First of all, it shows that the endeavor to jar, O., Cinková, S., Fučíková, E., Mikulová, M., Pa- unify the annotation schemes used for various treebanks in jas, P., Popelka, J., Semecký, J., Šindlerová, J., Štěpánek, J., the HamleDT project provides new opportunities for lin- Toman, J., Urešová, Z., Žabokrtský, Z.: Announcing guistic research. The treebank data can now be studied in Prague Czech-English Dependency Treebank 2.0. In: Pro- a relation to other treebanks using the common search tool ceedings of the 8th International Conference on Language and obtaining results which are not dependent on peculiar- Resources and Evaluation (LREC 2012), Istanbul, Turkey, ities of individual annotation schemes. ELRA, European Language Resources Association (2012), These new opportunities have been demonstrated on a 3153–3160 small-scale experiment involving three languages (Czech, [9] Brants, S., Dipper, S., Eisenberg, P., Hansen, S., König, E., German and English). We have managed to extract quanti- Lezius, W., Rohrer, C., Smith, G., Uszkoreit, H.: TIGER: Linguistic Interpretation of a German Corpus. Journal of tative clues confirming the linguistic hypothesis about the Language and Computation (2004), 597–620 degree of word order freedom of all three languages un- [10] Pajas, P., Štěpánek, J.: System for querying syntactically der consideration. The main advantage of our approach annotated corpora. In: Proceedings of the ACL-IJCNLP is the fact that our research is based on a large number of 2009 Software Demonstrations, Suntec, Singapore, Asso- sentences of each language and thus it provides a repre- ciation for Computational Linguistics (2009), 33–36 sentative sample of the actual language usage in a given [11] Holan, T., Kuboň, V., Oliva, K., Plátek, M.: On complexity genre. Contrary to theoretical linguistic research, our ap- of word order. Les grammaires de dépendance – Traitement proach does not concentrate upon marginal (but definitely automatique des langues (TAL) 41 (2000) 273–300 linguistically interesting) phenomena, but it is based upon [12] Kuboň, V., Lopatková, M., Plátek, M.: On formalization the real language captured in the treebanks. of word order properties. In: Gelbukh, A., (ed.), Theoret- In the future we would like to continue the research in ical Computer Science and General Issues, Computational two directions. One will be the obvious endeavor to collect Linguistics and Intelligent Text Processing, CICLing 2012, the statistics for more languages, the second one will be a volume 7181 of LNCS., Berlin / Heidelberg, Springer- more subtle treatment of linguistic phenomena appearing Verlag (2012) 130–141 in treebanks, as, e.g. the investigation including also sub- [13] Mitchell P. Marcus, Mary Ann Marcinkiewicz, B.S.: Build- ordinated clauses or interrogative sentences. ing a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19 (1993) Grant support This paper exploits language data developed and/or dis- tributed in the frame of the project MŠMT ČR LIN- DAT/CLARIN (project LM2010013). References [1] Saussure, F.: Course in general linguistics. Open Court, La Salle, Illinois (1983) (prepared by C. Bally and A. Sechehaye, translated by R. Harris) [2] Saussure, F.: Kurs obecné lingvistiky. Academia, Praha (1989) (translated by F. Čermák) [3] Sapir, E.: Language. An introduction to the study of speech. Harcourt, Brace and Company, New York (1921) (http://www.gutenberg.org/files/12629/12629-h/ 12629-h.htm). [4] Skalička, V.: Vývoj jazyka. Soubor statí. Státní pedagog- ické nakladatelství, Praha (1960)