-

Series

1613-0073

Searching for a Measure of Word Order Freedom

Vladislav Kubonˇ

vk@ufal.mff.cuni.cz 0

Markéta Lopatková

Tomáš Hercig

1 2 0 Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics 1 Department of Computer Science and Engineering, Faculty of Applied Sciences, University of West Bohemia , Univerzitní 8, 306 14 Plzenˇ , Czech Republic 2 NTIS-New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia , Technická 8, 306 14 Plzenˇ , Czech Republic

2016

1649 11 17

This paper compares various means of measuring of word order freedom applied to data from syntactically annotated corpora for 23 languages. The corpora are part of the HamleDT project, the word order statistics are relative frequencies of all word order combinations of subject, predicate and object both in main and subordinated clauses. The measures include Euclidean distance, max-min distance, entropy and cosine similarity. The differences among the measures are discussed.

The question of different features of natural languages has been engrossing theoretical linguists for hundred of years. They have been studying various language characteristics and classifying natural languages according to their properties, giving arise of a language typology, see esp. [ 1 ] and [ 2 ], or [ 3 ], to mention also the Czech tradition. These investigations led to a system of four basic language types, namely isolated, agglutinative, inflectional and polysynthetic languages.

Theoretical linguists have introduced an extensive list of relevant language features, a summary can be found, e.g., in the World Atlas of Language Structures (WALS) [ 4 ]. We will focus one particular phenomenon, word order of natural languages. While the classification of languages cannot be based upon a single phenomenon, the word order characteristics seems to belong among important features both for theoretical research and for practical natural language applications.

Languages are typically classified according to the degree of word order freedom to (more or less) fixed word order and free word order languages. The former type is often exemplified by English, where a word order position encodes a syntactic function (e.g., the first noun in an indicative sentence, having prototypically the function of subject, is followed by a predicative verb and a noun with the object functions); this property typically correlates with under-developed flection. The later type can be exemplified by Czech, where a syntactic function is encoded by morphological case marking [ 5 ], and word order expresses an information structure.

From the practical point of view, a freedom of word order to a great extent correlates with a parsing difficulty of a particular natural language (a language with more fixed word order is typically easier to parse than a language containing, e.g., non-projective constructions). On top of that, modern unsupervised methods of natural language processing might also profit from investigations of a similar kind as we present in this paper. If researchers would have an exact information about the properties of a language which they want to process using unsupervised methods, this knowledge might help them to choose an adequate processing method and/or to properly set its parameters.

The examination of a natural language typology have been traditionally based upon a systematic observation of linguistic material. However, linguistic research is in completely different position now: linguistic observations can be based on large amount of language data stored in corpora which have been growing not only in size but also in complexity of annotation during the last decade.

Moreover, several attempts to propose an unified annotation scheme – let us mention at least Stanford Dependencies and Stanford Universal Dependencies [ 6, 7, 8 ],1 Google Universal Tags [ 9 ], Universal Dependencies [ 10 ],2 – make it possible to use existing corpora for different languages.

In this paper we exploit the annotation developed in the frame of the HamleDT project (Harmonized MultiLanguage Dependency Treebank [ 11 ]).3

We have already presented a study where we focused on word order properties of HamleDT treebanks and the languages ranking – we used a simple max-min distance based on a distribution of sentences among all variants of the word order. Here we re-calculate the results of the experiments described in [ 12 ] using standard measures like Euclidean distance, entropy, and cosine similarity.

In the remaining sections of the paper we are first going to introduce the data and tools used for the experiment, section 3 describes the setup of the experiment, section 4 presents the results and the final section discusses the conclusions and possible directions for future work. 1http://nlp.stanford.edu/software/stanford-dependencies.shtml 2http://universaldependencies.org/ 3https://ufal.mff.cuni.cz/hamledt 2

Available Data Resources and Tools HamleDT (Harmonized Multi-Language Dependency Treebank, [ 11 ])4 is a compilation of existing dependency treebanks (or dependency conversions of other treebanks), transformed so that they all conform to the same annotation style. These treebanks as well as searching tools are available through a repository for linguistic data and resources LINDAT/CLARIN.5 2.1

Corpora

HamleDT integrates corpora for several tens of languages. Wherever it is possible due to license agreements, the corpora are transformed into a common data and annotation format, which enables a user – after a very short period of getting acquainted with each particular treebank – to search and analyze comfortably the data of a particular language.

The HamleDT family of treebanks is based on the dependency framework and technology developed for the Prague Dependency Treebank (PDT),6 i.e., large syntactically annotated corpus for the Czech language [ 15 ]. Here we focus on the so-called analytical layer, describing a surface sentence structure (relevant for studying word order properties). Unfortunately, due to various technical and licensing restrictions, it was not possible to use all treebanks contained in HamleDT. Thus our effort focusses on 23 treebanks with available annotation on this syntactic layer, which still represent a wide variety of languages having various word-order properties.

As an example, Figure 1 shows three dependency representations for an English sentence in the HamleDT format.7 Tables 1 and 2 provide an overview of the languages and the size of the corpora examined in our experiment. 2.2

Querying Tool

The advantage of using a common annotation framework for multiple treebanks also has a very useful consequence – instead of developing tailor-made searching tools we can apply a common tool to all treebanks we are analyzing. In the case of HamleDT, we can use the PML-TQ [ 16 ] search tool,8 originally developed for processing the data from PDT.

Having the treebanks in the common data format and annotation scheme, the PML-TQ framework makes it possible to analyze the data in a uniform way. A typical user 4https://ufal.mff.cuni.cz/hamledt 5https://lindat.mff.cuni.cz/ 6http://ufal.mff.cuni.cz/pdt3.0 7Data of each treebank in HamleDT are distributed in three annotation schemes – (a) the transformation of the treebank to the praguian style (used in PDT; leftmost in Figure 1), (b) the original annotation format of the given treebank (or its dependency transformation in case of non-dependency treebanks; in the middle of Figure 1), and (c) the transformation of the treebank to the Universal Dependencies style (rigthmost in the figure).

8https://lindat.mff.cuni.cz/services/pmltq/ interested in monolingual data can use PML-TQ in an interactive way. Such approach would, of course, not work for our set of 23 treebanks, therefore we have used a command line interface which PML-TQ also provides. This interface makes it possible to create scripts that process a specified set of treebanks automatically.

Let us now give an example of a PML-TQ query used in our analysis. It counts sentences having an SVO word order in the main clause.

a-node $p := [ depth() = "1", id ~ "prague", afun = "Pred", tag ~ "ˆV", 1x a-node

[ afun = "Sb" ], 1x a-node

[ afun = "Obj" ], a-node

[ afun = "Sb", ord < $p.ord ], a-node

[ afun = "Obj", ord > $p.ord ] ]; >> give count()

The query searches data annotated in the praguian style (id ~ "prague") for sentences containing verbs (tag ~ "ˆV") with the analytical function of a predicate (afun = "Pred") at the depth of one level below the technical root of the tree (depth() = "1"; i.e., this query focuses on the word order in main clauses, excluding coordinated predicates and disregarding also subordinate clauses). There must be exactly one subject and one object directly depending on the predicate (for the subject: 1x a-node [afun = "Sb"]), the subject must precede the verb (afun = "Sb", ord < $p.ord), and the object must follow it (afun = "Obj", ord > $p.ord). The result of the query is the count of such sentences (>> give count()). The visualization of the PML-TQ query can be found in Figure 2. 3

The Experiment

In order to avoid possible bias caused by a combination of too many language phenomena in complicated sentences, we have decided to exclude all sentences containing coordinated predicates, subjects or objects from our experiment. The phenomenon of coordination is to some extent “orthogonal” to that of word order (especially in dependency-based approaches to a language description); thus the results might have been negatively influenced if coordination of verbs or the coordination of its direct dependents would be allowed.

In this experiment, we have focused on “full” structures, i.e., sentences with core syntactic structure consisting of subject, predicate and object. We have created several queries aiming at a thorough investigation of the phenomenon of the mutual position of these syntactic units. a-tree zone=en_prague requires Pred VB-S---3P-----a-tree zone=en_orig .

AuxK Z:------------requires ROOT VBZ merger approval Sb Obj NNXSX---------- NNXSX---------The the of Atr Atr AuxP PZXXX---------- PZXXX---------- RR--X---------merger approval .

SBJ OBJ P NN NN .

The the of NMOD NMOD NMOD DT DT IN a-tree zone=en requires root VERB merger approval . nsubj dobj punct NOUN NOUN PUNCT The the det det DET DET authorities nmod

NOUN of Norwegian case amod ADP ADJ authorities Atr NNXPX---------Norwegian Atr AOXX-----1----authorities PMOD NNS Norwegian NMOD JJ The results presented in Tables 1 and 2 may serve as a basis for an estimation of a degree of word order freedom of individual languages. A typical mutual position of a subject, a predicate and an object constitutes one of the basic typological characteristic of a natural language. The problem of measuring the degree of word order freedom cannot be, of course, reduced only to this phenomenon, the freedom of word order of other sentence elements should probably be taken into account as well. Our decision to base the estimation on just these three constituents has several reasons. First of all, these constituents are present in a vast majority of sentences, they constitute a certain backbone of every sentence. Second, they are also relatively easily identifiable in all treebanks, regardless of the original annotation schemes. Although the HamleDT treebanks provide uniform annotation, the transformation of less frequent language phenomena from various languages may provide results which are not as uniform as we would like them to be. Last but not least, the three main constituents are located on top of the dependency tree, they do not require overly complex queries which might bring additional bias into the experiment.

The number we are looking for would describe how far is the distribution of individual variants of word order from the ideal absolutely free order of the main constituents. It is obvious that the languages with the highest degree of word order freedom would demonstrate the most equal distribution of sentences among all variants of the word order described in our tables, i.e., the frequency of all variants of the order of subject, verb and object will be equal to 16.66% (let us denote this “ideal vector” as Y )9. The difference between an actual distribution vector of each particular language from our table and this ideal vector then expresses the difference in word order freedom.

There are several measures which we can use for these 9The equal frequency of all variants actually means that there are probably no grammatical rules which would prefer any order of constituents over the others.

Number of Number of SVO sentences matches (%) OVS (%)

calculations.10 Let us start with the simplest one, the maxmin measure (marked as M1 in the subsequent text): M1 = max xi − min xi

i∈1,..n i∈1,..n

This measure has a value 0 for the ideal vector. The higher its value, the more fixed seems to be the word order of that particular language. The main advantage of this measure is its ability to reduce n-dimensional vectors into two dimensions only (leaving aside all four other values), thus enabling simple graphical representation. The same property also constitutes the greatest disadvantage of this measure, i.e. its insensitivity to subtle differences in distribution of values among the four variants which were actually left aside.

The second measure is the standard Euclidean distance between two vectors (marked as M2 in the subsequent text):

M2 = kX − Y k = s n ∑ (xi − yi)2 i=1

In this formula, the symbol X represents the distribution of word order variants of a given language and Y is the “ideal vector” with equal distribution of frequencies. The Euclidean distance is more precise than M1 because it reflects all six variants of the word order.

The third measure, very often used for measuring the similarity of two vectors in information retrieval, is the cosine similarity (marked as M3 in the subsequent text): M3 =

∑in=1 (xi × yi) p∑in=1 (xi)2 × p∑in=1 (yi)2

Actually, because both M2 and M3 represent a distance between two vectors (although measured by different means and providing numerically different values), their results with regard to the estimation of word order freedom would be very similar, the main difference being the order of the numerical values of M2 and M3. While the values of M2 are decreasing with the growing word order freedom, the values of M3 are increasing.

Because M2 and M3 are in principle quite similar, let us therefore use one more measure which is also quite natural and widely used, namely the entropy (marked as M4 in the subsequent text):

n M4 = − ∑ P(xi) ln P(xi)

i=1

The values P(xi) are the probabilities of individual word order variants. Because we do not know the exact probabilities, we are going to use their relative frequencies from Tables 1 and 2. The entropy is maximal for the equal distribution of relative frequencies (probabilities), minimal for 10Actually, the word measure should not be understood as a strictly mathematical term. The cosine similarity is not a measure in a mathematical sense, it does not have all properties required by the mathematical definition of the term measure. an absolutely deterministic system which has only one acceptable type of the word order. In other words, the higher is the entropy for a particular language, the higher is its degree of word order freedom.

The results obtained for all four measures are presented in Tables 3 and 4. In order to enable an easier comparison of individual measures, we are presenting also the rank of all languages with regard to their degree of word order freedom for each particular measure. The ranks then show how similar the measures are. In both tables, the order of languages corresponds to their rank according to the M1 measure applied to main sentences.

Table 3 shows the rank of individual languages with regard to the word order freedom calculated according to all measures mentioned above. It was calculated on main sentences with “full” structure, i.e. main sentences containing both subject and (exactly one) object, and although the rank according to each individual measure differs (with the exception of M2 and M3 which provide, not surprisingly, an absolutely identical rank), the highest rank always belongs to the two classical languages, Latin and Ancient Greek, closely followed by three Slavic languages (Slovak, Slovenian and Czech) and German. The languages with the most fixed word order are, according to all measures, English, Japanese, Estonian and Hindi.

When comparing both tables, we may notice some substantial differences in the word order freedom rank for main and subordinated clauses. We may identify two distinctive groups of languages which exhibit a relatively big rank shift. The languages with substantially higher degree of the word order freedom in subordinated clauses are Arabic, Catalan and Estonian. The languages with exactly opposite property are Bengali, German and Dutch. In case of Dutch we may recall the famous examples of phenomena exceeding the expressive power of context-free languages, namely the subordinated clauses such as ...dat Jan Piet de kinderen zag helpen zwemmen (... that Jan saw Piet help the children swim) where the Dutch syntax requires a very strict order of words. Also in German, the word order in subordinated clauses follows much stricter rules than in the main ones. In this respect, the results obtained through our experiment correlate with the syntactic rules of the language. 5

Final Remarks and Conclusion

Although the results presented in this paper support to a relatively great extent the intuitive comprehension of the notion of word order freedom of ”big” European languages, there are at least two aspects of our experiment which are, according to our opinion, quite interesting. The first one is the fact that our experiment is based solely on data, publicly available in syntactically annotated corpora. Thanks to this fact the experiment does not require the knowledge of, or even the familiarity with all the languages under investigation. On the other hand, some of

Treebank M1 Rank M2 Rank M3 Rank M4 Rank

the corpora contained in the HamleDT set are too small to constitute a reliable source of information about the properties of a given language. However, this obstacle can be easily overcome in the future with the growing size and number of treebanks available under a common annotation scheme.

The second interesting aspect is the comparison of measures which give in principle very similar results and thus they support the claim that the phenomenon of word order freedom may be quantified practically by any reasonably selected measure. In other words, it is not necessary to develop any specialized measures just for this particular purpose, it is enough if we use the well known ones, such as the Euclidean distance or entropy.

Grant support

The work on this project was partially supported by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).

This work was also supported by the project LO1506 of the Czech Ministry of Education, Youth and Sports and by Grant No. SGS-2016-018 Data and Software Engineering for Advanced Applications.

This work has been using language resources and tools developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).

[1] Saussure , F. : Course in General Linguistics. Open Court, La Salle, Illinois ( 1983 ) (prepared by C. Bally and A . Sechehaye, translated by R. Harris).

[2] Sapir , E. : Language . An Introduction to the Study of Speech . Harcourt, Brace and company, New York ( 1921 ) (http://www .gutenberg.org/files/12629/12629- h/12629-h.htm).

[3] Skalicˇka , V.: Vývoj jazyka . Soubor statí. Státní pedagogické nakladatelství , Praha ( 1960 )

[4] Dryer , M.S. , Haspelmath , M.: The World Atlas of Language Structures Online . Harcourt, Brace and company, Leipzig ( 2005 -2013) Available online at http://wals.info, Accessed on 2015- 06 -28.

[5] Futrell , R. , Mahowald , K. , Gibson , E.: Quantifying Word Order Freedom in Dependency Corpora . In: Proceedings of the International Conference on Dependency Linguistics (Depling 2015 ), Uppsala, Sweden, Uppsala University ( 2015 )

[6] de Marneffe , M.C. , MacCartney , B. , Manning , C.D.: Generating typed dependency parses from phrase structure parses . In: Proceedings of LREC 2006 . ( 2006 )

[7] de Marneffe , M.C. , Manning , C.D. : The Stanford typed dependencies representation . In: COLING Workshop on Cross-framework and Cross-domain Parser Evaluation . ( 2008 )

[8] de Marneffe , M.C. , Dozat , T. , Silveira , N. , Haverinen , K. , Ginter , F. , Nivre , J. , Manning , C. : Universal Stanford Dependencies: A cross-linguistic typology . In: Proceedings of LREC 2014 . ( 2014 )

[9] McDonald , R. , Nivre , J.: Characterizing the errors of datadriven dependency parsing models . In: Proceedings of EMNLP-CoNLL 2007 . ( 2007 )

[10] Nivre , J., de Marneffe, M.C. , Ginter , F. , Goldberg , Y. , Hajicˇ , J., Manning , C. , McDonald , R. , Petrov , S. , Pyysalo , S. , Silveira , N. , Tsarfaty , R. , Zeman , D. : Universal dependencies v1: A multilingual treebank collection . In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016 ), Portorož, Slovenia, European Language Resources Association ( 2016 )

[11] Zeman , D. , Dušek , O. , Marecˇek , D. , Popel , M. , Ramasamy , L. , Šteˇpánek , J., Žabokrtský , Z. , Hajicˇ , J.: HamleDT: Harmonized multi-language dependency treebank . Language Resources and Evaluation 48 ( 2014 ) 601 - 637

[12] Kubonˇ , V. , Lopatková , M. , Mírovský , J.: Analysis of Word Order in Multiple Treebanks . In: Proceedings of CICLing 2016 . LNCS, Berlin Heidelberg, Springer-Verlag ( 2016 )

[13] Lopatková , M. , Kubonˇ , V.: Free or FixedWord Order: What can Treebanks Reveal? In Yaghob , J., ed.: Information Technologies - Applications and Theory , Prague, Charles University in Prague ( 2015 ) 23 - 29

[14] Kubonˇ , V. , Lopatková , M. : Word-order analysis based upon treebank data . In Sidorov, G., Galicia-Haro , S., eds. : MICAI 2015: Advances in Artificial Intelligence and Soft Computing , Part I . Volume 9413 ., Berlin / Heidelberg, Springer ( 2015 ) 47 - 58

[15] Bejcˇek , E., Hajicˇová , E., Hajicˇ , J., Jínová , P. , Kettnerová , V. , Kolárˇová , V., Mikulová , M. , Mírovský , J. , Nedoluzhko , A. , Panevová , J. , Poláková , L. , Ševcˇíková , M., Šteˇpánek , J., Zikánová, Š.: Prague Dependency Treebank 3 .0. Charles University in Prague, MFF, ÚFAL , Prague ( 2013 ) (http://ufal .mff. cuni.cz/pdt3 .0/).

[16] Pajas , P. , Šteˇpánek , J.: System for Querying Syntactically Annotated Corpora . In: Proceedings of the ACL-IJCNLP 2009 Software Demonstrations , Suntec, Singapore, Association for Computational Linguistics ( 2009 ) 33 - 36