<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Word Length in Tatar: Selecting Relevant Parameters for Modeling</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Äytelmägän wasɪyät / Unspoken Testament</institution>
          ,
          <addr-line>Chapter 1</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Kazan Federal University</institution>
          ,
          <addr-line>Kremlyovskaya St, 18, 420008 Kazan</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>zɪl çäçäklär / The Red Flowers</institution>
          ,
          <addr-line>Chapter 1</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper studies word length in the Tatar language examining data of fiction texts (the sample includes examples of both prose and poetry). Word length is a stochastic phenomenon depending on a great number of factors, including language type, text organization, its addressee, etc.; however, there are internal linguistic laws governing parameters of word length and frequencies of words, and the issue comprises universal and language specific features. We found that ration of words of different length are dissimilar in individual texts, and the most common words are those composed of 5 phonemes and 2 syllables. We evaluated word length in Tatar texts and attempted to fit a model based on Poisson distribution (in particular, a model based on one-displaced Poissonuniform distribution was used), so description of empirical data was complemented with fitting theoretical values for word frequencies. Besides, Shannon's entropy of word lengths was evaluated, and a weak correlation between the average word length and entropy was found.</p>
      </abstract>
      <kwd-group>
        <kwd>word length</kwd>
        <kwd>syllable</kwd>
        <kwd>Poisson distribution</kwd>
        <kwd>the Tatar language</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Rigorously designed computational models can help making theoretical proposals by
providing clues about their limitations or internal inconsistency, so they can
contribute to improvement of theoretical approaches setting quite strict requirements for
them. Length of linguistic items is one of significant formal features of languages
which finds application in text processing, spell checking algorithms, language
teaching, etc. Word length is studied by specialists in linguistics and text analysis as well as
mathematicians and statisticians working on related issues. Basic approaches to word
length are presented in papers by P. Grzybek [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], G. Altmann [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and I. Popescu and
colleagues [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Word length can be measured by the number of phonemes, morphemes or syllables
in it, depending on research goals. A number of words of different length depends on
the language type: for example, in synthetic languages, there is a higher
morphemeto-word ratio than in analytic languages [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The analytic structure of English (a great
number of auxiliary verbs, prepositions as well as articles), determines existence of a
great number of one-syllable words, so distribution of English word lengths can be
Copyright © 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
approximated by means of geometric distribution [
        <xref ref-type="bibr" rid="ref2 ref6">2, 6</xref>
        ]. The Tatar language uses
agglutination to express syntactic relationships within a sentence, which, with lack of
articles, significantly restricts the number of one-syllable words, limiting it to root
words. So languages are characterized by dissimilar features in distribution of words
of different length, which demands dissimilar ways of word length modeling.
      </p>
      <p>
        Abundant linguistic data on word length provide great opportunities for selecting
parameters and model fitting (Zipf ́s law, Menzerath-Altmann law, approximating by
means of different distributions, etc.). Behind a superficial simplicity of the concept
of word length, a number of surprises may hide, “so here nothing helps but incessant
testing, modeling, different viewing of data, modification of hypotheses, collecting of
data from new languages, etc. Every “new” language can falsify a beloved theory or
force us to modify it” [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        There is a lack of special works devoted to word length in Turkic languages,
although some data on Turkish is presented in overviews and papers covering
multilingual data (for example, in [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. In dissertation by L. Rizvanova, certain aspects of
word length in Tatar related to functional styles are considered [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. On a special page
of the Corpus of Written Tatar, a list of Tatar wordforms sorted by length is
presented [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>The aim of this paper is to empirically evaluate word length in Tatar and to model
it using Poisson distribution. Tatar fiction texts are used as empirical source; the
written texts were brought into a phonologically relevant form, to allow counting the
number of phonemes and syllables per word.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Word length in Tatar texts</title>
      <p>
        We examined a distribution of words of different length in 10 Tatar fiction texts
(prose and poetry texts were used, 5 of both kind; brief information on selected texts
is presented in Appendix). The written texts were brought into the standard form: 1
letter – 1 sound, and special rules were set to convert Tatar texts into a phonologically
relevant form. When tokenizing, co-compounds (like ata-ana ('mother' + 'father')
'parents' [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] were regarded as individual words.
      </p>
      <p>Figure 1 represents the number of words of different length in the text by A. Eniki;
the word lengths are measured in phonemes. This text contains 2,169 words with
length from 1 to 14 phonemes. Words with length 5 are the most frequent.
Words in the text by A. Eniki contain 4,962 syllables; the words are composed of one
to six syllables, with the most frequent words being those with two syllables (see
Figure 2).
The distribution of words depending on the number of phonemes and the number of
syllables is represented in Figure 3. The most frequent are words consisting of 4 and 5
phonemes divided into 2 syllables.
0.007
0.081
0.132
0.163
0.219
0.127
0.101
0.087
0.042
0.023
0.008
0.005
0.003
0.001</p>
      <p>0</p>
      <p>
        WL in
phonemes Eniki
We computed the ratio of words consisting of a different number of phonemes in the
selected texts, then calculated the entropy [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] of word length; the results are
represented in Table 1.
We found a weak correlation between the mean word length and the entropy: the
correlation coefficient is 0.57. According to the sample of the texts, the average
entropy of word length in prose is greater, and is less significant in poetry: it makes
2.923 for poetical texts and 3.135 for prose. This suggests that issues related to
entropy in Tatar texts require further study.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Towards modeling</title>
      <p>
        A great variety of distributions has been tested in word length studies, and Poisson
distribution is often in focus of researchers who can modify it in different ways.
Poisson distribution is a very simple, and in many cases sufficient means for modeling.
Researchers believe that Poisson distribution, either in its usual form, or displaced to
the right, or truncated above the zero point (positive Poisson distribution) should be
used at the very beginning of any investigation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In particular, the one-displaced
Poisson-uniform distribution was used by V. Kromer to model German texts [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        We follow this way and rely upon the approach by V. Kromer [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] for modeling
word length in Tatar. V. Kromer proposed a mathematical model of word length
based on the Čebanov-Fucks distribution [
        <xref ref-type="bibr" rid="ref3 ref9">3, 9</xref>
        ] with equal distributions of the
parameter. The Čebanov-Fucks distribution is a modification of the known Poisson
distribution when the obligatory (first) syllable is not taken into account:
      </p>
      <p>
        Px = (λ - 1)x – 1 / (x – 1)! * e-(λ – 1) , x = 1 , 2 , 3 , ... ,
(1)
where Px is probability of textual word occurrence with length x, and λ is the
distribution parameter [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. The latter could be estimated by the mean word length in the
text (λ0), so this parameter is strictly determined by the text data.
      </p>
      <p>We computed the probabilities of the textual word occurrence with the given
length and obtained theoretical values for occurrences of words with the given length
in each text. The results are given in Table 2 (fitted values with λ0 parameter).</p>
      <p>Then parameter λ0 was replaced by lambdas belonging to the interval: [λ0 - 0.1 * λ0,
λ0 + 0.1 * λ0]; length out was 150 for best approximation, and theoretical values of
word occurrences for each case were computed. Then the discrepancy between
experimental and theoretical data was evaluated by the Pearson Chi-square criterion χ 2 and
the best fitted values were selected (λ*). The results for 5 texts are represented in
Table 2.</p>
      <p>WL in
syllables
1
2
3
4
5
6
λ
χ2
Total
390
1030
567
171
23
2
Eniki
611
778
495
210
67
17
2.273
401.6
2183
We empirically evaluated word length in Tatar texts and attempted to fit a model
based on Poisson distribution. Word lengths were measured in terms of phonemes and
in terms of syllables.</p>
      <p>Texts are not homogeneous because of internal rules of self-organization, so
portions of words of different length are dissimilar in individual texts, and the most
common words are those composed of 5 phonemes and 2 syllables.</p>
      <p>Shannon's entropy of word lengths was evaluated, and a weak correlation between
the average word length and the entropy was found with correlation coefficient
equaling 0.57. According to the examined sample of texts, the average entropy of word
lengths in prose is greater, and less significant in poetry, which may be the case due to
the requirements of the poetic meter.</p>
      <p>For modeling lengths of Tatar words, the one-displaced Poisson-uniform
distribution was used. Although the results are generally consistent with those for other
languages described in literature, nevertheless there is a significant discrepancy between
the observed and fitted values, so using other modifications of Poisson distribution as
well as using other distributions is needed in further research.</p>
      <p>The results of the study of word length in Tatar can help in development of
applications for style and register detection, authorship analysis and language teaching as
well as they can be used in theoretical studies of language structure and complexity.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>The work is carried out according to the Russian Government Program of
Competitive Growth of Kazan Federal University.</p>
    </sec>
    <sec id="sec-5">
      <title>Appendix</title>
      <sec id="sec-5-1">
        <title>Basic information on the texts processed</title>
        <p>No
2
3
4
8
9
10</p>
        <sec id="sec-5-1-1">
          <title>Eniki, Amirkhan</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>Tukay, Gabdulla</title>
        </sec>
        <sec id="sec-5-1-3">
          <title>Tukay, Gabdulla</title>
        </sec>
        <sec id="sec-5-1-4">
          <title>Tukay, Gabdulla</title>
        </sec>
        <sec id="sec-5-1-5">
          <title>Amirkhan, Fatikh</title>
        </sec>
        <sec id="sec-5-1-6">
          <title>Ibrahimov, Galimjan</title>
        </sec>
        <sec id="sec-5-1-7">
          <title>Alish, Abdulla</title>
        </sec>
        <sec id="sec-5-1-8">
          <title>Gilman, Galimdzhan Suleyman Zulfat</title>
          <p>Şüräle / Forest Spirit</p>
          <p>Su anasɪ / Aquatic Woman
Käcä belän sarɪk äkiyäte/ The tale of the
goat and the ram</p>
          <p>Häyät
Hayat, Chapter 1</p>
          <p>Sertotmas ürdäk /
The Talkative Duck</p>
          <p>Oçraşu / Встреча</p>
          <p>Dürt mizgel / Four moments
Novel, prose
fairy tale in</p>
          <p>verse
fairy tale in</p>
          <p>verse
fairy tale in</p>
          <p>verse</p>
        </sec>
        <sec id="sec-5-1-9">
          <title>Novel, prose</title>
        </sec>
        <sec id="sec-5-1-10">
          <title>Novel, prose fairy tale for children, prose</title>
          <p>Story, prose
poem
poem
925
419
579
548
444
917
1,014
355
163</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Volume in words</title>
        <p>2,169
Number of
syllables
4,962
1,917
854
1,211
1,310
1,085
2,093
2,351
815
360</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Altmann</surname>
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Aspects of word length</article-title>
          .
          <source>Issues in Quantitative Linguistics</source>
          ,
          <volume>3</volume>
          ,
          <fpage>23</fpage>
          -
          <lpage>38</lpage>
          . RAMVerlag,
          <string-name>
            <surname>Lüdenscheid</surname>
          </string-name>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Elderton</surname>
            ,
            <given-names>W.P.:</given-names>
          </string-name>
          <article-title>A Few Statistics on the Length of English Words</article-title>
          .
          <source>Journal of the Royal Statistical Society, series A (general)</source>
          ,
          <volume>112</volume>
          ,
          <fpage>436</fpage>
          -
          <lpage>445</lpage>
          . Wiley, New Jersey (
          <year>1949</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Fucks</surname>
          </string-name>
          , W.:
          <article-title>Matematicheskaja teorija slovoobrazovanija [Mathematical theory of word formation]</article-title>
          . In:
          <article-title>Teorija peredachi soobshhenij (Trudy 3 mezhdunarodnoj konferencii</article-title>
          )
          <source>[Theory of messaging (3rd conference proceedings)]</source>
          ,
          <fpage>221</fpage>
          -
          <lpage>247</lpage>
          . Foreign Languages, Moscow (
          <year>1957</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Galieva</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suleymanov</surname>
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Tatar Co-compounds as a Special Type of Classifiers</article-title>
          .
          <source>In: Proceedings of the XVII EURALEX International Congress: Lexicography and Linguistic Diversity</source>
          ,
          <fpage>678</fpage>
          -
          <lpage>684</lpage>
          . Ivane Javakhishvili Tbilisi State University, Tbilisi (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Greenberg</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A Quantitative Approach to the Morphological Typology of Language</article-title>
          .
          <source>International Journal of American Linguistics</source>
          ,
          <volume>26</volume>
          (
          <issue>3</issue>
          ),
          <fpage>178</fpage>
          -
          <lpage>194</lpage>
          . The University of Chicago, Illinois (
          <year>1960</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Grzybek</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>History and methodology of word length studies</article-title>
          . In:
          <article-title>Contributions to the Science of Text and Language</article-title>
          .
          <source>Word Length Studies and Related Issues</source>
          ,
          <fpage>15</fpage>
          -
          <lpage>90</lpage>
          . Springer, Heidelberg (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kromer</surname>
          </string-name>
          , V.:
          <article-title>About Word Length Distribution</article-title>
          . In:
          <article-title>Contributions to the Science of Text and Language</article-title>
          .
          <source>Word Length Studies and Related Issues</source>
          ,
          <fpage>199</fpage>
          -
          <lpage>210</lpage>
          . Springer, Heidelberg (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kromer</surname>
          </string-name>
          , V.:
          <article-title>Word length model based on the one-displaced Poisson-uniform distribution</article-title>
          .
          <source>Glottometrics</source>
          ,
          <volume>1</volume>
          ,
          <fpage>87</fpage>
          -
          <lpage>96</lpage>
          . RAM-Verlag, Lüdenscheid (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Piotrovskij</surname>
            ,
            <given-names>R.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bektaev</surname>
            ,
            <given-names>K.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piotrovskaja</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          :
          <article-title>Matematicheskaja lingvistika</article-title>
          [Mathematical linguistics].
          <source>High School</source>
          , Moscow (
          <year>1977</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>I.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kelih</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rovenchak</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et. al:
          <article-title>Word length: aspects and languages</article-title>
          .
          <source>Issues in Quantitative Linguistics</source>
          ,
          <volume>3</volume>
          ,
          <fpage>224</fpage>
          -
          <lpage>281</lpage>
          . RAM-Verlag, Lüdenscheid (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Rizvanova</surname>
            ,
            <given-names>L.M.:</given-names>
          </string-name>
          <article-title>Kvantitativnaja harakteristika tatarskogo slova: na materiale otdel'nyh funkcional'nyh stilej [Quantitative features of the Tatar word: on material of functional styles]. Candidate thesis in filology</article-title>
          . Kazan University, Kazan (
          <year>1996</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Shannon</surname>
            ,
            <given-names>C.E.</given-names>
          </string-name>
          :
          <source>A Mathematical Theory of Communication. Bell System Technical Journal</source>
          <volume>27</volume>
          (
          <issue>3</issue>
          ),
          <fpage>379</fpage>
          -
          <lpage>423</lpage>
          . Wiley, New Jersey (
          <year>1948</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Statistics. Corpus of Written Tatar, www.corpus.tatar/index_en.php?of=stat_en.htm,
          <source>last accessed</source>
          <year>2020</year>
          /11/13.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>