<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Predicting default and non-default aspectual coding: Impact and density of information features*†</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Richter</string-name>
          <email>mprrichter@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tariq Yousef</string-name>
          <email>tariq@informatik.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Leipzig University</institution>
          ,
          <addr-line>Natural Language Processing Group</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a study on the automatic classification of default and nondefault codings for aspect-marked verbs in six Slavic and one Baltic language. As classifier a Support Vector Machine (SVM) and as verbal features Shannon Information (SI) and Average Information Content (IC) have been utilised. In all languages high accuracy of the classification has been achieved. In addition, we found indications for the validity of the Uniform Information Density principle within SI and IC.</p>
      </abstract>
      <kwd-group>
        <kwd>Verb aspect</kwd>
        <kwd>coding</kwd>
        <kwd>information content</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>verbs in these languages. As data resource we exploited Universal Dependency
Treebanks in CoNNL-U format (https://universaldependencies.org) because verbal aspect
is encoded in these corpora, as exemplified for the Latvian verb pierādīt ‘prove’ in
figure 1. The token pierādījuši ‘proven’ carries perfective aspect:</p>
      <p>
        What does default and non-default coding mean? Our point of departure is that verbs
have a dominant aspect category and that this category can be determined by frequency
distributions: default forms will occur more frequently than non-default forms. Take as
an example the Polish verb spotkać ‘meet’. This verb form has the default aspect
‘perfective’ while the verb form spotykać carries imperfective aspect and is thus
non-default coded. The Form Frequency Correspondence Principle (henceforth FFC, [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) is
based on this default /non-default-dichotomy. FFC says that default-coded words (in
general) tend to be shorter than non-default-coded words and - according to Zipf’s
principle of least effort [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] – longer words carry more information than shorter ones
(otherwise the greater length, that is, the higher effort, would be uneconomic).
      </p>
      <p>
        The second aim of the study is to test whether the Uniform Information Density –
hypothesis (henceforth UIDh [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]) holds within the features IC and SI of
the target verbs. This is a novel interpretation of UIDh. The hypothesis says that the
amount of information within messages should cross linguistically be uniform and there
should neither be extreme peaks nor extreme troughs in the stream of information in
order to facilitate language processing and comprehension. Our research question is:
Are there extreme information peaks and troughs within a single linguistic unit which
might make the procession of that unit difficult?
      </p>
      <p>
        According to UIDh, the variances in information density in the languages in the
focus of this study should not be far apart. In its original form, UIDh is applied to discrete
signs carrying individual information. We, however, apply UIDh to two different
information values of a single sign. UIDh is formulated within the framework of
Surprisal theory: the difficulty of processing signs of natural language is proportional to
its informativity in context [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and signs must not be too informative in order to be
processable. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] states that surprisal is a measure of reranking cost: facing an
unexpected word in the sentence, a (human) sentence processor has revise his or her
incremental expectations that is, a “shift in the resource allocation (equivalently, in the
conditional probability distribution over interpretations)” is required [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In this study we
test the prediction whether SI and IC have a uniform information density (UID) that is,
the information values should not have high variances [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and tend towards zero.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        Although the interaction of IC and coding has, to our best knowledge, has not yet been
studied for natural language, the interaction of IC and length of words is the topic in a
couple of studies. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] brought to light that IC is a strong predictor of phone deletion in
English. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] showed for ten Indo-European languages that IC, estimated from bigram-,
trigram-, and 4-gram-contexts of the target words, is a better predictor of word length
than frequency. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] ascribe the attested correlation of word length and information
content to the principle of UID: the amount of information over time must be constant,
and it follows that longer word forms must be more informative than short ones.
      </p>
      <p>
        [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] investigated for Arabic, Chinese, English, Finnish, German, Hindi, Persian,
Russian and Spanish, whether the length of words can be better predicted by IC, when
it is estimated from syntactic dependents rather than from unstructured contexts of
target words. Her finding was that words that convey more IC to their contexts tend to be
longer. The study of [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] yielded a controversial result: for 30 languages in focus, the
lengths of aspect-coded verbs could be better predicted by unigrams than by syntactic
contexts.
      </p>
      <p>
        The validity of UIDh has been tested so far only for distinct linguistic units: [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] found – in order to test UIDh - a positive correlation between surprisal and
difficulty of signs, which was operationalized by measuring reading times: surprising words
in sentences need more time to be read. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] showed in their study on omission of the
relative pronoun in English relative clauses (RC), that if that is expected and thus low
informative, it tends to be omitted. However, in cases of unexpected and high
informative RC, that is not omitted: The use of the relativiser signals to the human processor
that a relative sentence follows, and thus reduces the amount of surprisal and
information. Using the example of article omission in German, [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] demonstrated, that UID
depends on whether information is determined by terminal symbols or by POS tags and
that POS tags provide a better basis for explaining article-omission.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <sec id="sec-3-1">
        <title>Data</title>
        <p>Data resources are the corpora ‘bg_btb-ud-train.csv’ (Bulgarian), 'cu-ud-train.csv’ (Old
Church Slavonic), ‘pl_lfg-ud-train.csv’ (Polish), ‘sk_snk-ud-train.csv’ (Slovak),
‘sl_ssj-ud-train.csv’ (Slovenian), ‘uk_iu-ud-train.csv’ (Ukrainian) and
‘lv_lvtb-udtrain.csv’ (Latvian), from the Universal Dependency Treebank, version 2.3
(https://universaldependencies.org). All aspect-marked verbs were extracted. The
number of the resulting verb forms for each language is displayed in table 1.</p>
        <p>Ukrainian
9,789
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Classifier and features</title>
        <p>
          We employed a Support Vector Machine binary classifier with a radial basis function
kernel [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] which utilises as features IC and SI . The aim was to classify the data (aspect
marked verbs) into two categories, default (0) and non-default (1). We used 80% of the
data set to train the model, and the rest to assess the quality of the classifier. The
estimation of IC is given in (1), it is the average amount of information, that a verb form
conveys within all of its contexts. :
 = (−+(( =  | = 1)))
(1)
        </p>
        <p>
          IC is the expectation value of the negative log of conditional probability of a verb
form w (marked with imperfective or with perfective aspect) given contexts C. As
contexts, we took bigrams, i. e. lexical surprisal ([
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]), to both directions of the target
verbs since a study of [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] disclosed that target verbs convey the highest amount of
information in this context window. In (2), the estimation of SI is given [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. SI is the
information of each individual verb form w in its contexts:
 = −+4( = | )9
(2)
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Default and non-default forms</title>
        <p>For each verb, the default and non-default aspect was determined. We reduced aspect
oppositions to the binary imperfective-perfective distinction and subsumed the habitual
and progressive aspects under the imperfective and the resultative aspect under the
perfective aspect, respectively. Verb forms in the prospective aspect have been ignored,
since its value is not clear with respect to the imperfective and perfective opposition.
We checked for every verb the number of occurrences in perfective and imperfective
aspect, and took the difference of both occurrences. The more frequent aspect forms
were taken as default aspect of the respective verb lemma.</p>
        <p>The differences were normalized, and ten thresholds between [.09:1] were set as
differences between default and non-default. The threshold '1' was omitted a priori,
since it captures cases of verbs occurring only in one aspect form that is, either solely
perfective or solely imperfective aspect.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>We focused on the thresholds in the interval [.19, .59] on the normalised threshold-scale,
in order to ensure a sufficient number of default and nondefault encodings for the training
of the SVM-classifier. The thresholds of the interval [.59, .99] provided a too small
number of non-default aspect coded verb forms. At the lowest threshold value, i.e. .19, the
frequencies of default and non-default coded verbs differ only slightly and both groups
are almost equally distributed. In table 2, the range of accuracy values within the interval
[.19, .59] for the seven languages in focus are given (left accuracy values for threshold
.19, the right values for threshold .59):</p>
      <p>
        It comes to light that the accuracy is almost independent of the threshold and thus of
the frequency distribution: even with an almost equal distribution of default and
nondefault aspect frequencies that is, with threshold .19, almost perfect accuracy values are
achieved. In order to estimate UID, we used (3). More precisely, we utilised global
information density UIDGLOBAL which is the variance within information values [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]:
idi is the information density of SI and IC of a single verb form, and µ is the mean of
id:
&lt;=&gt;?@== −(∑1DEF(1 − ))+
(3)
      </p>
      <p>Applying (3) to our test set of languages, an identical pattern in all languages comes
to light: the variance of information within IC and SI is small and the majority of
variance values tends to be close to zero (note that UIDGLOBAL values are negative by
definition). As illustration, UIDGLOBAL of Polish, Slovenian and Latvian are given in figure
2:</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>A classification with high accuracy of default / non-default coding of verbs could be
achieved with a SVM classifier and the features SI and IC. As Shannon’s source coding
theorem predicts, we found interaction of aspectual coding and information: Our study
provides evidence that non-default coded verb forms are more informative than default
forms. Almost identical accuracy has been achieved with all tested threshold values, and
we take this finding as an indication of a – in the average – constant amount of information
of IC and SI.</p>
      <p>
        With regard to the second aim, our study disclosed that UIDh holds within the features
IC and SI. The variation within the two features tends to be close to zero in all languages
in our test set and our prediction turns out to be correct: both features convey an uniform
stream of information throughout the forms of the seven languages in focus. This ensures
that information does not become, in the words of [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], "dangerously high". The question
arises whether UID can be consciously regulated in SI and IC, i.e. whether it is a
conscious linguistic behavior. If, for example, a speaker plans to use an unexpected and
therefore informative word form, he or she could at the same time decide to use that form in
expected contexts which cause not much surprisal. Whether regulation of SI and IC is a
conscious linguistic behavior is a question that requires future work in the form of
psycholinguistic experiments. A practical application of this study is POS-tagging in
languages with fuzzy distinction between word classes such as Tagalog. This is based on our
hypothesis that default / non-default-coding correlates with word classes for instance with
the noun / verb-distinction. According to this hypothesis, the word class of default form
of a lemma could differ from the word class of a non-default form.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Cohen</given-names>
            <surname>Priva</surname>
          </string-name>
          ,
          <string-name>
            <surname>U.</surname>
          </string-name>
          :
          <article-title>Using information content to predict phone deletion</article-title>
          .
          <source>In: Proceedings of the 27th West Coast Conference on Formal Linguistics</source>
          , pp.
          <fpage>90</fpage>
          -
          <lpage>98</lpage>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Piantadosi</surname>
            ,
            <given-names>S. T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tily</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gibson</surname>
            ,
            <given-names>E:</given-names>
          </string-name>
          <article-title>Word lengths are optimized for efficient communication</article-title>
          .
          <source>PNAS</source>
          ,
          <volume>108</volume>
          (
          <issue>9</issue>
          ),
          <fpage>3526</fpage>
          -
          <lpage>3529</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Shannon</surname>
            ,
            <given-names>C. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weaver</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>A mathematical theory of communication</article-title>
          .
          <source>The Bell System Technical Journal</source>
          <volume>27</volume>
          (
          <year>1948</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Text categorization with Support Vector Machines: Learning with many relevant features (</article-title>
          <year>1998</year>
          ). Retrieved from http://www.cs.cornell.edu/people/tj/publications/joachims_98a.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Haspelmath</surname>
            ,
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Calude</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spagnol</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Narrog</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bamyaci</surname>
          </string-name>
          , E.:
          <article-title>Coding causal noncausal verb alternations: A form-frequency correspondence explanation</article-title>
          .
          <source>Journal of Linguistics</source>
          ,
          <volume>50</volume>
          (
          <issue>3</issue>
          ),
          <fpage>587</fpage>
          -
          <lpage>625</lpage>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Zipf</surname>
            ,
            <given-names>G. K.</given-names>
          </string-name>
          :
          <article-title>Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology</article-title>
          . Addison-Wesley Press (
          <year>1949</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Genzel</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Charniak</surname>
          </string-name>
          , E.:
          <article-title>Entropy rate constancy in text</article-title>
          .
          <source>In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL)</source>
          , Philadelphia,
          <year>July 2002</year>
          , pp.
          <fpage>199</fpage>
          -
          <lpage>206</lpage>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Aylett</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turk</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The Smooth Signal Redundancy Hypothesis: A functional explanation for relationships between redundancy, prosodic prominence, and duration in spontaneous speech</article-title>
          .
          <source>Language and Speech</source>
          ,
          <volume>47</volume>
          (
          <issue>1</issue>
          ),
          <fpage>31</fpage>
          -
          <lpage>56</lpage>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaeger</surname>
            ,
            <given-names>T. F.</given-names>
          </string-name>
          :
          <article-title>Speakers optimize information density through syntactic reduction</article-title>
          .
          <source>In: Proceedings of the 20th Conference on Neural Information Processing Systems (NIPS)</source>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Jaeger</surname>
            ,
            <given-names>T. F.</given-names>
          </string-name>
          :
          <article-title>Redundancy and reduction: Speakers manage syntactic information density</article-title>
          .
          <source>Cognitive Psychology</source>
          ,
          <volume>61</volume>
          (
          <issue>1</issue>
          ),
          <fpage>23</fpage>
          -
          <lpage>62</lpage>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hale</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A probabilistic Earley parser as a psycholinguistic model</article-title>
          .
          <source>In: Proceedings of NAACL</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Levy</surname>
          </string-name>
          , R.:
          <article-title>Memory and Surprisal in Human Sentence Comprehension</article-title>
          . In: van Gompel,
          <string-name>
            <surname>R</surname>
          </string-name>
          . (ed.)
          <source>Sentence Processing</source>
          , pp.
          <fpage>78</fpage>
          -
          <lpage>114</lpage>
          . Psychology Press, Hove (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Collins,
          <string-name>
            <surname>M. X.</surname>
          </string-name>
          :
          <article-title>Information density and dependency length as complementary cognitive models</article-title>
          .
          <source>Journal of Psycholinguistic Research</source>
          ,
          <volume>43</volume>
          (
          <issue>5</issue>
          ),
          <fpage>651</fpage>
          -
          <lpage>681</lpage>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Levchina</surname>
          </string-name>
          , N.:
          <article-title>Communicative efficiency and syntactic predictability: A crosslinguistic study based on the universal dependencies corpora</article-title>
          .
          <source>In: Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies</source>
          ,
          <source>(UDW</source>
          <year>2017</year>
          )
          <article-title>(</article-title>
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Richter</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kyogoku</surname>
          </string-name>
          . Y.,
          <string-name>
            <surname>Kölbl</surname>
          </string-name>
          . M.:
          <article-title>Interaction of Information Content and Frequency as predictors of verbs' lengths</article-title>
          . In: Abramowicz ,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Corchuelo</surname>
          </string-name>
          ,
          <string-name>
            <surname>R</surname>
          </string-name>
          . (eds.)
          <source>Business Information System. 22nd International Conference, BIS</source>
          <year>2019</year>
          , Seville, Spain, June 26-28,
          <year>2019</year>
          , Proceedings,
          <source>Part I (Lecture Notes in Business Information Processing 353)</source>
          , pp.
          <fpage>271</fpage>
          -
          <lpage>282</lpage>
          . Springer (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Levy</surname>
          </string-name>
          . R.:
          <article-title>Expectation-based syntactic comprehension</article-title>
          .
          <source>Cognition</source>
          ,
          <volume>106</volume>
          :
          <fpage>1126</fpage>
          -
          <lpage>1177</lpage>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Horch</surname>
          </string-name>
          , E., Reich, I.:
          <year>2016</year>
          .
          <article-title>On “Article Omission” in German and the “Uniform Information Density Hypothesis”</article-title>
          .
          <source>In: Proceedings of the 13th Conference on Natural Language Processing (KONVENS</source>
          <year>2016</year>
          ), pp.
          <fpage>125</fpage>
          -
          <lpage>127</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>