J. Yaghob (Ed.): ITAT 2015 pp. 23–29
Charles University in Prague, Prague, 2015


                  Free or Fixed Word Order: What Can Treebanks Reveal?

                                             Vladislav Kuboň and Markéta Lopatková

            Charles University in Prague, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics
                                      Malostranské nám. 25, Prague 1, 118 00, Czech Republic
                                              {lopatkova,vk}@ufal.mff.cuni.cz

Abstract: The paper describes an ongoing experiment                  Russian and other Slavic languages are the so-called lan-
consisting in the attempt to quantify word-order proper-             guages with a high degree of word order freedom, they still
ties of three Indo-European languages (Czech, English and            stick to the same order of word in a typical (unmarked)
German). The statistics are collected from the syntacti-             sentence. As for the VSO-type languages, their represen-
cally annotated treebanks available for all three languages.         tatives can be found among semitic (Arabic, classical He-
The treebanks are searched by means of a universal query             brew) or Celtic languages, while (some) Amazonian lan-
tool PML-TQ. The search concentrates on the mutual or-               guages belong to the OSV type. These characteristics,
der of a verb and its complements (subject, object(s)) and           which are traditionally mentioned in classical textbooks
the statistics are calculated for all permutations of the three      of general linguistics [5], have been specified on the basis
elements. The results for all three languages are compared           of excerptions and careful examination by many linguists.
and a measure expressing the degree of word order free-                 Today, when we have at our disposal a wide range of
dom is suggested in the final section of the paper.                  linguistic data resources for tens of languages, we can eas-
   This study constitutes a motivation for formal modeling           ily confirm (or enhance by quantitative clues) their con-
of natural language processing methods.                              clusions. This paper represents one of the steps in this
                                                                     direction.
                                                                        The Institute of Formal and Applied Linguistics at the
1    Introduction                                                    Charles University in Prague, has established a repository
                                                                     for linguistic data and resources LINDAT/CLARIN1 . This
General linguistics, see esp. [1, 2] studies natural lan-            repository enables experiments with syntactically anno-
guages from the point of view of similarities and differ-            tated corpora, so called treebanks, for several tens of lan-
ences in their syntactic structure, their development and            guages. Wherever it is possible due to the license agree-
historical changes, as well as from the point of view of             ments, the corpora are trasformed into a common format,
language functions. It studies mutual influence of partic-           which enables – after a very short period of getting ac-
ular groups of features and, on the basis of similarities of         quainted with each particular treebank – a comfortable
language phenomena it introduces the so called language              search and analysis of the data from a particular language.
typology [3, 4]. The freedom or, on the other hand, strict-          The HamleDT2 (HArmonized Multi-LanguagE Depen-
ness of the word order definitely belongs among the most             dency Treebank) project has already managed to transform
important phenomena. General linguistics, for example,               more than 30 treebanks from all over the world [6] into
studies whether and how a particular language handles the            a common format.
order of words in sentences – whether the word is deter-                In this pilot study we concentrate on three Indo-
mined primarily by syntactic categories (e.g., a noun or a           european languages which substantially differ by the de-
pronoun, without any additional morphological signs, lo-             gree of word freedom – Czech, German and English. We
cated on the first sentential position represents a subject          investigate their typological properties on the basis of the
in English), or whether syntactic categories are primarily           Prague Dependency Treebank [7], the English part of the
determined by other means than by the word order (for ex-            Prague Czech-English Dependency Treebank[8] and the
ample, in Slavic languages, the subject tends to be a noun           German treebank TIGER [9] by means of the interface of
in the nominative case, regardless of its position in the sen-       PML-TQ Tree Query [10], which enables the access to the
tence).                                                              treebanks from the HamleDT.3
   Particular natural languages cannot be, of course,
strictly characterized by a single feature (for example word
order), they are typically categorized into individual lan-          2    Setup of the Experiment
guage types by a mixture of characteristic features. If we
concentrate on word order, we study the prevalent order of           The analysis of syntactic properties of natural languages
the verb and its main complements – indo-european lan-               constitutes one of our long term goals. The phenomenon
guages are thus characterized as SVO (SVO reflecting the             of word order has been in a center of our investigations
order Subject, Verb, Object) languages. English and other                1 https://lindat.mff.cuni.cz/cs/
languages with a fixed word order typically follow this                  2 http://ufal.mff.cuni.cz/hamledt

order of words in declarative sentences; although Czech,                 3 https://lindat.mff.cuni.cz/services/pmltq/
24                                                                                                              V. Kuboň, M. Lopatková


for a very long time. Our previous investigations concen-       part of) the Prague Czech English Dependency Treebank
trated both on studying individual properties of languages      (PCEDT)6 [8] – within this project, syntactically anno-
with higher degree of word-order freedom (as, e.g., non-        tated Penn Treebank7 [13] was automatically transformed
projective constructions (long-distance dependencies) [11]      from the original phrase-structure trees into the depen-
as well as on the endeavor to find some general measures        dency annotation.8 Based on this experience, the Ham-
enabling to more precisely characterize concrete natural        leDT initiative goes further, syntactically annotated cor-
languages with regard to the degree of their word-order         pora for different languages are collected and transferred
freedom (see, e.g. [12]).                                       into the common format. Here we make use of the TIGER
   The experiment presented in this paper continues in the      corpus9 for the German language [9], the corpus with na-
same direction. It is driven by the endeavor to find an ob-     tive phrase-structure annotation enriched with the infor-
jective way how to compare natural languages from the           mation about the head for each phrase (and thus bearing
point of view of the degree of their word-order freedom.        also information on dependencies). Figures 2, 6 and 7
While the previous experiments concentrated on more for-        show sample trees for Czech, English and German, respec-
mal approach, this one builds upon a thorough analysis of       tively, and Table 1 summarizes the size of these corpora.
available data resources. Let us briefly introduce them in
the subsequent subsections.                                       corpus       # preds      lang          type             genre
   When investigating syntactic properties of natural lan-        PDT          79,283       Czech         manual           news
guages, it is very often the case that the discussion concen-     PCEDT        51,048       English       automatic        economy
trates on individual phenomena, their properties and their        TIGER        36,326       German        automatic        news
influence on the order of words. The mere presence of
some phenomenon (or its more detailed properties) is, of        Table 1: Overview of all three treebanks (# preds repre-
course, important and definitely influences the degree of       sents the number of predicates in the given corpora)
word-order freedom but this kind of investigation cannot
be complete without stating also the quantitative proper-
ties of the given phenomenon. A linguistically interesting,     2.2 HamleDT and PMLTQ Tree Query
but marginal phenomenon does not tell us so much as a ba-
sic phenomenon occurring relatively frequently. This ob-        For searching the data, we exploit a PML-TQ search
servation constitutes the basis of our current experiment.      tool,10 which has been primarily designed for processing
In order to capture the quantitative characteristic of a nat-   the PDT data. PML-TQ is a query language and search en-
ural language, let us take a representative sample of its       gine designed for querying annotated linguistic data [10]
syntactically annotated data and let us calculate the distri-   – it allows users to formulate complex queries on richly
bution of individual types of word order for the three main     annotated linguistic data.
syntactic components – subject, predicate and object. It is        Having the treebanks in the common data format, the
obvious that the more free is the word order of a given lan-    PML-TQ framework makes it possible to analyse the data
guage, the more equally they are going to be distributed.       in a uniform way – the following sample query gives
                                                                us trees with an intransitive predicate verb (in a main
                                                                clause), i.e. Pred node with Sb node and no Obj nodes
2.1    Available Treebanks                                      among its dependent nodes, where Sb follows the Pred;
                                                                the filter on the last line (>> for $n0.lemma give $1,
The extensive quantitative analysis of the same linguistic
                                                                count() ) outputs a table listing verb lemmas with this
phenomenon for different languages would not be feasi-
                                                                marked word order position and number of their occur-
ble without a common platform which makes it possible
                                                                rences in the corpus, see also Figure 1.
to compare various data resources from the same point of
view. Thanks to the initiative HamleDT4 (HArmonized                a-node $n0 :=
Multi-LanguagE Dependency Treebank) it is now possi-               [ afun = "Pred",
ble to compare the data from more than 30 languages in                 child a-node $n1 :=
a uniform way [6].                                                     [ afun = "Sb", $n1.ord > $n0.ord ],
   The HamleDT family of treebanks is based on the de-                 0x child a-node
pendency framework and technology developed for the                    [ afun = "Obj"]]
Prague Dependency Treebank (PDT)5 [7], i.e., large syn-            >> for $n0.lemma give $1, count()
tactically annotated corpus for the Czech Language. Here            6 http://ufal.mff.cuni.cz/pcedt2.0/cs/index.html
we focus on the so-called analytical layer, i.e., the layer         7 https://www.cis.upenn.edu/         treebank
describing surface sentence structure (relevant for study-          8 This dependency-based surface annotation then served as a basis for

ing word order properties). The framework and its               deep syntactic dependency-based annotation of English; however, as for
language independence was verified within (the English          Czech, only surface structure is interesting for the studied phenomenon
                                                                of word order.
                                                                    9 http://www.ims.uni-stuttgart.de/forschung/
      4 http://ufal.mff.cuni.cz/hamledt                         ressourcen/korpora/tiger.html
      5 http://ufal.mff.cuni.cz/pdt3.0                             10 https://lindat.mff.cuni.cz/services/pmltq/
Free or Fixed Word Order: What Can Treebanks Reveal?                                                                             25


       Figure 1: Visualization of the PML-TQ query

3 Analysis of Data
Let us now look at the syntactic typology of natural lan-
guages under investigation. We are going to take into
account especially the mutual position of subject, predi-                  Figure 2: Sample Czech dependency tree from PDT
cate and direct object. After a thorough investigation of
the ways how indirect objects are annotated in all three
corpora, we have decided to limit ourselves – at least in
this stage of our research – to basic structures and to ex-
tract and analyse only sentences without too complicated
or mutually interlocked phenomena. Namely we focus on
sentences with the following properties:
   • A predicate under scrutiny belongs to the main clause
     (as e.g. in the sentence JsouPred vám nejasná některá
     ustanovení daňových zákonů? ‘ArePred certain pro-
     visions of the tax laws unclear to you?’, see the de-
     pendency tree in Fig. 2); i.e., we do not analyse word
     order of dependent clauses;
   • We analyse only non-prepositional subjects and ob-
     jects (compare e.g. with the sentence V 2180 městech
     a obcích žije na 2.6 milionu obyvatelSb ; ‘There are
     (about 2.6 milion of inhabitants)Sb living in 2 180
     towns and villages;’, see Fig. 3);
   • Sentences may contain coordinated predicates (as,
     e.g., predicates následoval and opakovalo in the cor-
     pus sentence Vzápětí následovalPred další regulační
     stupeň a vše se opakovaloPred . ‘The next level of
     regulation immediately followedPred and everything                 Figure 3: Sample Czech dependency tree from PDT with
     repeatedPred again.’, see Fig. 4);                                 prepositional subject (excluded from the resulting tables)
     However, sentences with common subjects (or ob-
     jects) are not taken into account (thus sentences as,              3.1 Czech
     e.g., KoupelnaSb nebo teplá vodaSb nejsou trvale k                 The highest quality syntactically annotated Czech data can
     dispozici. ‘A bathroomSb or hot water supplySb are                 be found in the Prague Dependency Treebank; in fact, it
     not at the permanent disposal.’, see Fig. 5 are not                is the only corpus we work with that has been manually
     counted in the tables).11                                          annotated and thoroughly tested for the annotation con-
   11 Including coordination phenomena in all their complexity would    sistency. The texts of PDT belong mostly to the journal-
require much robust queries in any dependency framework; thus we have   ism genre, it consists of newspaper texts and (in a limited
decided to disregard this type of sentences at all.                     scale) of texts from a popularizing scientific journal.
26                                                                                                                     V. Kuboň, M. Lopatková


                                                                                        Word order type      Number          %
                                                                                             SV               16,909      56.66
                                                                                             VS               12,932      44.34
                                                                                            Total             29,841     100.00

                                                                                      Table 2: Sentences with intransitive verbs


                                                                              It is not surprising that the unmarked – intuitively "most
                                                                              natural" – word order type, SVO, accounts for only slightly
                                                                              more than half of cases. The relatively high degree of word
                                                                              order freedom is thus supported also quantitatively.

                                                                                        Word order type      Number          %
                                                                                            SVO               11,158      52.42
                                                                                            SOV                1,533       7.20
                                                                                            VSO                1,936       9.10
                                                                                            VOS                2,136      10.04
                                                                                            OVS                4,001      18.80
                                                                                            OSV                  521       2.45
                                                                                            Total             21,285     100.00
Figure 4: Sample Czech dependency tree from PDT with
coordinated predicates (included in the resulting tables)                              Table 3: Sentences with a single object

                                                                                 Even more interesting (and also supporting the claim
                                                                              that the word order freedom of Czech is relatively high)
                                                                              are the results for sentences with at least two objects. They
                                                                              are summarized in Table 4. The distribution is even flatter
                                                                              than in Table 3 with all types being represented (even those
                                                                              starting with two objects, see the following example) and
                                                                              none of them exceeding 30%.
                                                                                 Plán mu v úterý předložil velvyslanec USA v Chorvat-
                                                                                 sku Peter Galbraith.

                                                                                        Word order type      Number          %
                                                                                           SVOO                  293      26.95
                                                                                            SOVO                 223      20.52
                                                                                           SOOV                   33       3.04
                                                                                           VSOO                   45       4.14
                                                                                           VOSO                   16       1.47
                                                                                           VOOS                   27       2.48
                                                                                           OSVO                   70       6.44
                                                                                           OSOV                   10       0.92
Figure 5: Sample Czech dependency tree from PDT with                                       OOSV                   15       1.38
coordinated subject (excluded from the resulting tables)                                   OOVS                  124      11.41
                                                                                           OVSO                   78       7.18
                                                                                            OVOS                 153      14.08
   The following Table 2 summarizes the number of sen-
                                                                                             Total             1,087     100.00
tences with intransitive verbs in main clauses in PDT with
respect to the word order positions of Sb and Pred – we                                  Table 4: Sentences with two objects
can see that the marked word order (verb preceding its sub-
ject) is quite common in Czech.12
   The second table displays the distribution of individual
combinations of a subject, predicate and a single object.                     3.2 English

     12 In our settings, we do not checked the part of speech of the predi-   The statistics concerning the distribution of word-order
cate; however, out of the 79,283 sentences conforming to the properties       types for English have been calculated on the English
mentioned above, only 329 have other than verbal predicate.                   part of the Prague Czech English Dependency Treebank
Free or Fixed Word Order: What Can Treebanks Reveal?                                                                                       27


(PCEDT). This corpus actually contains the same set of                     were represented less than 10 times. In total, 23 verbs ap-
sentences as the Wall Street Journal section of Penn Tree-                 pear in these sentences at least twice, out of them 16 can
bank,13 (see above for references) but unlike its predeces-                be classified as verbs of communication (verba dicendi)
sor, its syntactic structure has been annotated using depen-               (in total, it means 678 occurrences out of 822, i.e., 82,5 %
dency trees. As was mentioned above, the transformation                    of all occurrences with at least two hits in the corpus).
on the surface syntactic layer was fully automatic, which                     The results for sentences containing one object also
has of course affected the quality of annotation.                          strongly confirm the fact that the order Subject - Predicate
                                                                           - Object (SVO) is practically the only acceptable order in
                                                                           standard sentences. The remaining types of word order
                                                                           (representing only 1.06% sentences in the corpus) men-
                                                                           tioned in Table 6 actually represented annotation errors in
                                                                           a vast majority of cases (esp. auxiliary verbs which have
                                                                           been quite often incorrectly annotated as Objects).

                                                                                      Word order type         Number             %
                                                                                          SVO                  12,481         98.94
                                                                                          SOV                      77          0.61
                                                                                          VSO                       9          0.07
                                                                                          VOS                       1          0.01
                                                                                          OVS                       2          0.02
                                                                                          OSV                      45          0.36
                                                                                          Total                12,615        100.00

                                                                                Table 6: English sentences with a single object

                                                                              It turns out that for English, it does not make sense to
                                                                           construct a similar table as Table 4 sentences with more
 Figure 6: Sample English dependency tree from PCEDT                       than one object. The automatic annotation of PCEDT is,
                                                                           unfortunately, biased in what should be considered an Ob-
   The statistics of different types of word order have been               ject (in the original Penn Treeank annotation, the verbal
collected in the same manner as in the previous subsec-                    complements are labeled just as noun (or prepositional)
tion. We have also applied identical filters as for Czech                  phrases (NPs and PPs), no distinction between Objects and
sentences from PDT. Table 5 contains data for sentences                    Adverbials.) As a consequence, adverbial constructions
with intransitive verbs. Only as few as 40 sentences have                  are very often incorrectly annotated as Objects and thus it
other than verbal predicate.                                               is impossible to rely on this distinction (and the analysis
                                                                           shows that the numbers would be highly misleading).
            Word order type         Number            %
                 SV                  28,236        96.91
                 VS                     900         3.09                   3.3 German
                Total                29,136       100.00                   German has more constraints on word order than Czech
                                                                           and less than English, therefore it constitutes a very nat-
     Table 5: English sentences with intransitive verbs
                                                                           ural candidate for our experiment. On top of that, there
                                                                           are also numerous high quality resources which can be ex-
   As we can see, the strict word order of English sen-
                                                                           ploited. We have used the German treebank conforming
tences manifests itself in a vast majority of sentences hav-
                                                                           to the HamleDT initiative, which is located in the Lindat
ing the prototypical word order of the subject being fol-
                                                                           repository.14
lowed by a predicate. The examples of the opposite word
                                                                              The statistics for German were collected in the same
order include sentences containing direct speech with the
                                                                           way and with the same constraints as Czech and English
following pattern:
                                                                           ones. The statistics for German sentences with intransitive
   "It’s just a matter of time before the tide turns," says one            predicates are presented in Table 7.
   Midwestern lobbyist.                                                       The almost equal number of sentences with SV and
Out of the 900 sentences with the reversed word order,                     VS word order types is quite surprising. The fact that
as many as 630 contained the predicate to say, 121 to                      SV represents the typical word order in declarative sen-
be. Each of all other verbs involved in these constructions                tences, while VS in interrogative ones provides an obvi-
                                                                           ous explanation. Unfortunately, this explanation does not
   13 The Czech part had been created as translation of original English

sentences.                                                                   14 https://lindat.mff.cuni.cz/services/pmltq/hamledt_dt_de/
28                                                                                                          V. Kuboň, M. Lopatková


                                                               4    Proposed Measure of Word Order
                                                                    Freedom

                                                               The statistics presented in the previous section actually
                                                               confirm the well known fact that Czech has the highest
                                                               degree of word order freedom from all three languages in-
                                                               vestigated in our experiment. This fact is also reflected in
                                                               the chart 8 comparing the results for sentences with one
                                                               object for all three languages.


                                                                   100

                                                                    80

                                                                    60

                                                                    40

                                                                     20
                                                                                                                        English
Figure 7: Sample German dependency tree from Ham-                        0
                                                                             SVO
                                                                                                                       German
leDT                                                                                SOV
                                                                                          VSO
                                                                                                VOS
                                                                                                                     Czech
                                                                                                      OVS
                                                                                                              OSV


           Word order type    Number          %
                SV              6,165      56.67                                   Figure 8: Comparison of results
                VS              4,713      43.33
               Total           10,878     100.00                  Let us now try to suggest a formula which might allow
                                                               to express the degree of word order freedom in a more
     Table 7: German sentences with intransitive verbs
                                                               precise way. Intuitively, the more free is the word order,
                                                               the more equally distributed should be the results of all
                                                               six word order types. The more strict the word order, the
cover all occurrences because the analyzed corpus (con-        more distant are the values from the ideal (equal distri-
sisting mostly of newspaper texts) contains only a very        bution). This leads directly to the application of a least
small proportion of interrogative sentences. We have not       squares method:
investigated the reason for the surprisingly high number of                            v
                                                                                       u6
VS sentences, but it definitely constitutes a very interest-                          1u
ing topic for further research. The same is valid also for                      M = t ∑ (Vi − Av)2 ,                  (1)
                                                                                      6 i=1
the results contained in Table 8, where we have found rel-
atively high number of sentences having the word order of         where M is the proposed measure, Vi the percentual
an interrogative sentence, too.                                value of the i-th word order type and Av is the average
                                                               percentage for each word type (i.e., 100/6). For the three
                                                               languages in our experiment we then get the following val-
           Word order type    Number          %                ues:
               SVO             10,662      50.31
               SOV                193       0.91                   • Czech: 6.82
               VSO              7,425      35.04
                                                                   • German: 19.20
               VOS                690       3.26
               OVS              2,206      10.41                   • English: 36.79
               OSV                 15       0.07
               Total           21,191     100.00                  These values seem to correspond to the intuitive feel-
                                                               ing that the word order order of English is really strongly
      Table 8: German sentences with a single object           fixed, while German and Czech have more free word order
                                                               with Czech having the highest degree of word order free-
                                                               dom. If we express the results in the form of percentages
   Neither for German we have investigated the sentences       of the absolutely fixed word order (i.e., one of the word or-
with two or more objects due to annotation inconsisten-        der types accounts for 100% and all others do not appear
cies.                                                          at all), we’ll get the following results:
Free or Fixed Word Order: What Can Treebanks Reveal?                                                                                  29


   • Czech: 18.31%                                                 [5] Čermák, F.: Jazyk a jazykověda. Pražská imaginace, Praha
                                                                       (1994)
   • German: 51.52%                                                [6] Zeman, D., Dušek, O., Mareček, D., Popel, M., Ra-
                                                                       masamy, L., Štěpánek, J., Žabokrtský, Z., Hajič, J.: Ham-
   • English: 98.73%                                                   leDT: Harmonized multi-language dependency treebank.
                                                                       Language Resources and Evaluation 48 (2014), 601–637
                                                                   [7] Hajič, J., Panevová, J., Hajičová, E., Sgall, P., Pajas, P.,
5 Conclusions
                                                                       Štěpánek, J., Havelka, J., Mikulová, M., Žabokrtský, Z.,
                                                                       Ševčíková-Razímová, M.: Prague Dependency Treebank
The experiment described in this paper brought several in-             2.0. LDC, Philadelphia, PA, USA (2006)
teresting results which may be taken as a basis for further        [8] Hajič, J., Hajičová, E., Panevová, J., Sgall, P., Bo-
experiments. First of all, it shows that the endeavor to               jar, O., Cinková, S., Fučíková, E., Mikulová, M., Pa-
unify the annotation schemes used for various treebanks in             jas, P., Popelka, J., Semecký, J., Šindlerová, J., Štěpánek, J.,
the HamleDT project provides new opportunities for lin-                Toman, J., Urešová, Z., Žabokrtský, Z.: Announcing
guistic research. The treebank data can now be studied in              Prague Czech-English Dependency Treebank 2.0. In: Pro-
a relation to other treebanks using the common search tool             ceedings of the 8th International Conference on Language
and obtaining results which are not dependent on peculiar-             Resources and Evaluation (LREC 2012), Istanbul, Turkey,
ities of individual annotation schemes.                                ELRA, European Language Resources Association (2012),
   These new opportunities have been demonstrated on a                 3153–3160
small-scale experiment involving three languages (Czech,           [9] Brants, S., Dipper, S., Eisenberg, P., Hansen, S., König, E.,
German and English). We have managed to extract quanti-                Lezius, W., Rohrer, C., Smith, G., Uszkoreit, H.: TIGER:
                                                                       Linguistic Interpretation of a German Corpus. Journal of
tative clues confirming the linguistic hypothesis about the
                                                                       Language and Computation (2004), 597–620
degree of word order freedom of all three languages un-
                                                                  [10] Pajas, P., Štěpánek, J.: System for querying syntactically
der consideration. The main advantage of our approach
                                                                       annotated corpora. In: Proceedings of the ACL-IJCNLP
is the fact that our research is based on a large number of
                                                                       2009 Software Demonstrations, Suntec, Singapore, Asso-
sentences of each language and thus it provides a repre-               ciation for Computational Linguistics (2009), 33–36
sentative sample of the actual language usage in a given
                                                                  [11] Holan, T., Kuboň, V., Oliva, K., Plátek, M.: On complexity
genre. Contrary to theoretical linguistic research, our ap-            of word order. Les grammaires de dépendance – Traitement
proach does not concentrate upon marginal (but definitely              automatique des langues (TAL) 41 (2000) 273–300
linguistically interesting) phenomena, but it is based upon       [12] Kuboň, V., Lopatková, M., Plátek, M.: On formalization
the real language captured in the treebanks.                           of word order properties. In: Gelbukh, A., (ed.), Theoret-
   In the future we would like to continue the research in             ical Computer Science and General Issues, Computational
two directions. One will be the obvious endeavor to collect            Linguistics and Intelligent Text Processing, CICLing 2012,
the statistics for more languages, the second one will be a            volume 7181 of LNCS., Berlin / Heidelberg, Springer-
more subtle treatment of linguistic phenomena appearing                Verlag (2012) 130–141
in treebanks, as, e.g. the investigation including also sub-      [13] Mitchell P. Marcus, Mary Ann Marcinkiewicz, B.S.: Build-
ordinated clauses or interrogative sentences.                          ing a large annotated corpus of English: the Penn Treebank.
                                                                       Computational Linguistics 19 (1993)

Grant support

This paper exploits language data developed and/or dis-
tributed in the frame of the project MŠMT ČR LIN-
DAT/CLARIN (project LM2010013).


References
 [1] Saussure, F.:      Course in general linguistics. Open
     Court, La Salle, Illinois (1983) (prepared by C. Bally and
     A. Sechehaye, translated by R. Harris)
 [2] Saussure, F.: Kurs obecné lingvistiky. Academia, Praha
     (1989) (translated by F. Čermák)
 [3] Sapir, E.: Language. An introduction to the study of
     speech. Harcourt, Brace and Company, New York (1921)
     (http://www.gutenberg.org/files/12629/12629-h/
     12629-h.htm).
 [4] Skalička, V.: Vývoj jazyka. Soubor statí. Státní pedagog-
     ické nakladatelství, Praha (1960)