Predicting default and non-default aspectual coding: Impact and density of information features*† Michael Richter and Tariq Yousef1 1 Leipzig University, Natural Language Processing Group mprrichter@gmail.com, tariq@informatik.uni-leipzig.de Abstract. This paper presents a study on the automatic classification of default and non- default codings for aspect-marked verbs in six Slavic and one Baltic language. As clas- sifier a Support Vector Machine (SVM) and as verbal features Shannon Information (SI) and Average Information Content (IC) have been utilised. In all languages high accuracy of the classification has been achieved. In addition, we found indications for the validity of the Uniform Information Density principle within SI and IC. Keywords: Verb aspect, coding, information content. 1 Introduction The first aim of the present study is to test whether default and non-default coding of aspect-marked verbs in the six Slavic languages Bulgarian, Old Church Slavonic, Polish, Slovak, Slovenian and Ukranian and, in addition, the Baltic language Latvian can be automatically classified by two verbal information features that is, (i) Average Information Content (henceforth ‘IC’) ([1], [2]), and (ii) Shannon Information (hence- forth ‘SI’, [3]). The aim and the choice of the two information features are motivated by Shannon’s source coding theorem [3] on the interaction of information, coding and length of signs within binary alphabets. We formulate the following research question: can Shannon’s theorem be transferred to natural languages and does coding of aspect marked verbs interact with the information that they carry? As classifier for the binary classification task that is, the classification of aspect-marked verbs into default- and non-default classes, we employed a Support Vector Machine (henceforth SVM, [4]). The choice of the test set of languages is motivated by the overt marking of aspect on * Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) project number: 357550571. † This work ‘Predicting default and non-default aspectual coding: Impact and density of in- formation features ‘ is the extended version of an abstract under the same title presented at KONVENS 2019 and at the time of the release of the proceedings of NL4AI, published at Preliminary Proceedings of the 15th conference of Natural Language Processing (KONVENS 2019): Kaleidoscope Abstracts, 275 – 277 (2019) by Michael Richter and Tariq Yousef. (https://creativecommons.org/licenses/by-nc-sa/4.0/) Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 verbs in these languages. As data resource we exploited Universal Dependency Tree- banks in CoNNL-U format (https://universaldependencies.org) because verbal aspect is encoded in these corpora, as exemplified for the Latvian verb pierādīt ‘prove’ in figure 1. The token pierādījuši ‘proven’ carries perfective aspect: Fig. 1. Corpus entry of the Latvian verb pierādīt ‘prove’ with aspect information. What does default and non-default coding mean? Our point of departure is that verbs have a dominant aspect category and that this category can be determined by frequency distributions: default forms will occur more frequently than non-default forms. Take as an example the Polish verb spotkać ‘meet’. This verb form has the default aspect ‘per- fective’ while the verb form spotykać carries imperfective aspect and is thus non-de- fault coded. The Form Frequency Correspondence Principle (henceforth FFC, [5]) is based on this default /non-default-dichotomy. FFC says that default-coded words (in general) tend to be shorter than non-default-coded words and - according to Zipf’s prin- ciple of least effort [6] – longer words carry more information than shorter ones (other- wise the greater length, that is, the higher effort, would be uneconomic). The second aim of the study is to test whether the Uniform Information Density – hypothesis (henceforth UIDh [7], [8], [9], [10]) holds within the features IC and SI of the target verbs. This is a novel interpretation of UIDh. The hypothesis says that the amount of information within messages should cross linguistically be uniform and there should neither be extreme peaks nor extreme troughs in the stream of information in order to facilitate language processing and comprehension. Our research question is: Are there extreme information peaks and troughs within a single linguistic unit which might make the procession of that unit difficult? According to UIDh, the variances in information density in the languages in the fo- cus of this study should not be far apart. In its original form, UIDh is applied to discrete signs carrying individual information. We, however, apply UIDh to two different in- formation values of a single sign. UIDh is formulated within the framework of Sur- prisal theory: the difficulty of processing signs of natural language is proportional to its informativity in context [11] and signs must not be too informative in order to be processable. [12] states that surprisal is a measure of reranking cost: facing an unex- pected word in the sentence, a (human) sentence processor has revise his or her incre- mental expectations that is, a “shift in the resource allocation (equivalently, in the con- ditional probability distribution over interpretations)” is required [12]. In this study we test the prediction whether SI and IC have a uniform information density (UID) that is, the information values should not have high variances [13] and tend towards zero. 2 Related work Although the interaction of IC and coding has, to our best knowledge, has not yet been studied for natural language, the interaction of IC and length of words is the topic in a 3 couple of studies. [1] brought to light that IC is a strong predictor of phone deletion in English. [2] showed for ten Indo-European languages that IC, estimated from bigram-, trigram-, and 4-gram-contexts of the target words, is a better predictor of word length than frequency. [2] ascribe the attested correlation of word length and information content to the principle of UID: the amount of information over time must be constant, and it follows that longer word forms must be more informative than short ones. [14] investigated for Arabic, Chinese, English, Finnish, German, Hindi, Persian, Russian and Spanish, whether the length of words can be better predicted by IC, when it is estimated from syntactic dependents rather than from unstructured contexts of tar- get words. Her finding was that words that convey more IC to their contexts tend to be longer. The study of [15] yielded a controversial result: for 30 languages in focus, the lengths of aspect-coded verbs could be better predicted by unigrams than by syntactic contexts. The validity of UIDh has been tested so far only for distinct linguistic units: [12] and [16] found – in order to test UIDh - a positive correlation between surprisal and diffi- culty of signs, which was operationalized by measuring reading times: surprising words in sentences need more time to be read. [9] showed in their study on omission of the relative pronoun in English relative clauses (RC), that if that is expected and thus low informative, it tends to be omitted. However, in cases of unexpected and high informa- tive RC, that is not omitted: The use of the relativiser signals to the human processor that a relative sentence follows, and thus reduces the amount of surprisal and infor- mation. Using the example of article omission in German, [17] demonstrated, that UID depends on whether information is determined by terminal symbols or by POS tags and that POS tags provide a better basis for explaining article-omission. 3 Method 3.1 Data Data resources are the corpora ‘bg_btb-ud-train.csv’ (Bulgarian), 'cu-ud-train.csv’ (Old Church Slavonic), ‘pl_lfg-ud-train.csv’ (Polish), ‘sk_snk-ud-train.csv’ (Slovak), ‘sl_ssj-ud-train.csv’ (Slovenian), ‘uk_iu-ud-train.csv’ (Ukrainian) and ‘lv_lvtb-ud- train.csv’ (Latvian), from the Universal Dependency Treebank, version 2.3 (https://universaldependencies.org). All aspect-marked verbs were extracted. The num- ber of the resulting verb forms for each language is displayed in table 1. Table 1. The number of verb forms in the test set of languages. language number of verb forms Bulgarian 13,714 Latvian 17,046 Old Church Slavonic 9,575 Polish 17,199 Slovak 11,749 Slovenian 11,629 4 Ukrainian 9,789 3.2 Classifier and features We employed a Support Vector Machine binary classifier with a radial basis function kernel [4] which utilises as features IC and SI . The aim was to classify the data (aspect marked verbs) into two categories, default (0) and non-default (1). We used 80% of the data set to train the model, and the rest to assess the quality of the classifier. The esti- mation of IC is given in (1), it is the average amount of information, that a verb form conveys within all of its contexts. : 𝐼𝐶 = 𝐸(−𝑙𝑜𝑔+(𝑃(𝑊 = 𝑤 |𝐶 = 𝑐1 ))) (1) IC is the expectation value of the negative log of conditional probability of a verb form w (marked with imperfective or with perfective aspect) given contexts C. As con- texts, we took bigrams, i. e. lexical surprisal ([11], [16]), to both directions of the target verbs since a study of [15] disclosed that target verbs convey the highest amount of information in this context window. In (2), the estimation of SI is given [3]. SI is the information of each individual verb form w in its contexts: 𝑆𝐼 = −𝑙𝑜𝑔+ 4𝑃(𝑊 = 𝑤| 𝑐𝑜𝑛𝑡𝑒𝑥𝑡)9 (2) 3.3 Default and non-default forms For each verb, the default and non-default aspect was determined. We reduced aspect oppositions to the binary imperfective-perfective distinction and subsumed the habitual and progressive aspects under the imperfective and the resultative aspect under the per- fective aspect, respectively. Verb forms in the prospective aspect have been ignored, since its value is not clear with respect to the imperfective and perfective opposition. We checked for every verb the number of occurrences in perfective and imperfective aspect, and took the difference of both occurrences. The more frequent aspect forms were taken as default aspect of the respective verb lemma. The differences were normalized, and ten thresholds between [.09:1] were set as differences between default and non-default. The threshold '1' was omitted a priori, since it captures cases of verbs occurring only in one aspect form that is, either solely perfective or solely imperfective aspect. 4 Results We focused on the thresholds in the interval [.19, .59] on the normalised threshold-scale, in order to ensure a sufficient number of default and nondefault encodings for the training of the SVM-classifier. The thresholds of the interval [.59, .99] provided a too small num- ber of non-default aspect coded verb forms. At the lowest threshold value, i.e. .19, the 5 frequencies of default and non-default coded verbs differ only slightly and both groups are almost equally distributed. In table 2, the range of accuracy values within the interval [.19, .59] for the seven languages in focus are given (left accuracy values for threshold .19, the right values for threshold .59): Table 2. Range of classification accuracy for the seven languages in our study. language accuracy (%) Bulgarian 99.5 – 99.8 Old Church Slavonic 94.3 – 97.8 Polish 99.7 – 99.9 Slovak 99. 5 – 99.6 Slovenian 100 – 100 Ukrainian 99.1 – 100 Latvian 98.3 – 99.5 It comes to light that the accuracy is almost independent of the threshold and thus of the frequency distribution: even with an almost equal distribution of default and non- default aspect frequencies that is, with threshold .19, almost perfect accuracy values are achieved. In order to estimate UID, we used (3). More precisely, we utilised global information density UIDGLOBAL which is the variance within information values [13]: idi is the information density of SI and IC of a single verb form, and µ is the mean of id: 𝑈𝐼𝐷<=>?@= = −𝐸(∑D 1EF(𝑖𝑑1 − 𝜇)) + (3) Applying (3) to our test set of languages, an identical pattern in all languages comes to light: the variance of information within IC and SI is small and the majority of vari- ance values tends to be close to zero (note that UIDGLOBAL values are negative by defi- nition). As illustration, UIDGLOBAL of Polish, Slovenian and Latvian are given in figure 2: 6 Fig. 2. UIDglob in Polish, Slovenian and Latvian. 7 5 Conclusion A classification with high accuracy of default / non-default coding of verbs could be achieved with a SVM classifier and the features SI and IC. As Shannon’s source coding theorem predicts, we found interaction of aspectual coding and information: Our study provides evidence that non-default coded verb forms are more informative than default forms. Almost identical accuracy has been achieved with all tested threshold values, and we take this finding as an indication of a – in the average – constant amount of information of IC and SI. With regard to the second aim, our study disclosed that UIDh holds within the features IC and SI. The variation within the two features tends to be close to zero in all languages in our test set and our prediction turns out to be correct: both features convey an uniform stream of information throughout the forms of the seven languages in focus. This ensures that information does not become, in the words of [9], "dangerously high". The question arises whether UID can be consciously regulated in SI and IC, i.e. whether it is a con- scious linguistic behavior. If, for example, a speaker plans to use an unexpected and there- fore informative word form, he or she could at the same time decide to use that form in expected contexts which cause not much surprisal. Whether regulation of SI and IC is a conscious linguistic behavior is a question that requires future work in the form of psy- cholinguistic experiments. A practical application of this study is POS-tagging in lan- guages with fuzzy distinction between word classes such as Tagalog. This is based on our hypothesis that default / non-default-coding correlates with word classes for instance with the noun / verb-distinction. According to this hypothesis, the word class of default form of a lemma could differ from the word class of a non-default form. References 1. Cohen Priva, U.: Using information content to predict phone deletion. In: Proceedings of the 27th West Coast Conference on Formal Linguistics, pp. 90 – 98 (2008). 2. Piantadosi, S. T., Tily, H., Gibson, E: Word lengths are optimized for efficient communica- tion. PNAS, 108(9), 3526 – 3529 (2011). 3. Shannon, C. E., Weaver, W.: A mathematical theory of communication. The Bell System Technical Journal 27 (1948). 4. Joachims, T.: Text categorization with Support Vector Machines: Learning with many rele- vant features (1998). Retrieved from http://www.cs.cornell.edu/people/tj/publications/joachims_98a.pdf. 5. Haspelmath, M, Calude, A., Spagnol, M., Narrog, H., Bamyaci, E.: Coding causal noncausal verb alternations: A form–frequency correspondence explanation. Journal of Linguistics, 50(3), 587 – 625 (2014). 6. Zipf, G. K.: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley Press (1949). 7. Genzel, D., Charniak, E.: Entropy rate constancy in text. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July 2002, pp.199 – 206 (2002). 8 8. Aylett , M., Turk, A.: The Smooth Signal Redundancy Hypothesis: A functional explanation for relationships between redundancy, prosodic prominence, and duration in spontaneous speech. Language and Speech, 47(1), 31 – 56 (2004). 9. Levy, R., Jaeger, T. F.: Speakers optimize information density through syntactic reduction. In: Proceedings of the 20th Conference on Neural Information Processing Systems (NIPS) (2007). 10. Jaeger, T. F.: Redundancy and reduction: Speakers manage syntactic information density. Cognitive Psychology, 61 (1), 23 – 62 (2010). 11. Hale, J.: A probabilistic Earley parser as a psycholinguistic model. In: Proceedings of NAACL, pp. 1 – 8 (2001). 12. Levy, R.: Memory and Surprisal in Human Sentence Comprehension. In: van Gompel, R. (ed.) Sentence Processing, pp. 78 – 114. Psychology Press, Hove (2013). 13. Collins, M. X.: Information density and dependency length as complementary cognitive models. Journal of Psycholinguistic Research, 43(5), 651 – 681 (2014). 14. Levchina, N.: Communicative efficiency and syntactic predictability: A crosslinguistic study based on the universal dependencies corpora. In: Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies, (UDW 2017) (2017). 15. Richter, M., Kyogoku. Y., Kölbl. M.: Interaction of Information Content and Frequency as predictors of verbs' lengths. In: Abramowicz , W., Corchuelo, R. (eds.) Business Information System. 22nd International Conference, BIS 2019, Seville, Spain, June 26–28, 2019, Pro- ceedings, Part I (Lecture Notes in Business Information Processing 353), pp. 271 – 282. Springer (2019). 16. Levy. R.: Expectation-based syntactic comprehension. Cognition, 106: 1126–1177 (2008). 17. Horch, E., Reich, I.: 2016. On “Article Omission” in German and the “Uniform Information Density Hypothesis”. In: Proceedings of the 13th Conference on Natural Language Pro- cessing (KONVENS 2016), pp.125 – 127 (2016). .