Zipfian discrimination Jim Blevins Petar Milin University of Cambridge University of Novi Sad jpb39@cam.ac.uk Eberhard Karls Universität Tübingen petar.milin@uni-tuebingen.de Michael Ramscar Eberhard Karls Universität Tübingen michael.ramscar@uni-tuebingen.de This talk outlines how form variation can be mod- of a specific form/meaning contrast becomes dis- elled in terms of equilibria between two domi- criminated from the form classes that express sim- nant communicative pressures. The pressure to ilar contrasts. Thus all learning serves to increase discriminate forms of a language enhances differ- the level of suppletion in form-meaning mappings. ences between expressions. Unchecked, this pres- Moreover, standard cases of ‘suppletion’ are sure can in principle lead to suppletion of the kind merely extreme instances of discriminative con- reported in languages such as Yélî Dnye (Hen- trasts that seem ubiquitous at the sub-phonemic derson ). However, in most languages, the level. In the domain of word formation, Davis pressure towards maximally discriminative expres- et al. () found suggestive differences in dura- sions is countered by the need to extrapolate from tion and fundamental frequency between a word sparse input. It has long been known that corpora like captain and a morphologically unrelated on- provide only a partial coverage of the forms of a set word such as cap. Of more direct relevance language (inflectional and derivational). This talk are studies of inflectional formations. Baayen et al. presents evidence that the shortfall is far greater () found that a sample of speakers produced and far more systematic than previously appreci- Dutch nouns with a longer mean duration when ated, and that the coverage of the form variation re- they occurred as singulars than as when they oc- mains sparse in corpora of up to one billion words. curred as the stem of the corresponding plural. In The sampling reported in this talk suggests that the a follow-up study, Kemps et al. () tested speak- forms in a corpus or encountered by a speaker ex- ers’ sensitivity to prosodic differences, and con- hibit a Zipfian distribution at all sample sizes. cluded that “acoustic differences exist between un- The interaction of these pressures also accounts inflected and inflected forms and that listeners are for the role of lexical neighbourhoods. Since most sensitive to them” (Kemps et al. : ). Recent paradigms will be only partially attested, the orga- studies by Plag et al. () find similar contrasts nization of paradigms into neighbourhoods pro- between phonemically identical affixes in English. vides an analogical base for extrapolation. The role of discriminability The status of regularity From a discriminative perspective, it is regularity It is usually assumed that regularity in a linguistic that stands in need of explanation. Learning mod- system is desirable or normative and that supple- els offer a solution here as well. Unlike derivational tion and other irregularities represent deviations processes, inflectional processes are traditionally from the uniform patterns that systems (or their assumed to be highly productive, defining uniform speakers) strive to maintain. From a discrimina- paradigms within a given class. Lemma size is thus tive perspective, the situation is exactly reversed. not expected to vary, except where forms are un- To the extent that patterns like suppletion enhance available due to paradigm ‘gaps’ or ‘defectiveness’. the discriminability of forms, they contribute to the Yet corpus studies suggest that this expectation communicative efficiency of a language. In a dis- is an idealization. Many potentially available in- criminative model, such as that of Ramscar et al. flected forms are unattested in corpora. As corpora (), the only difference between overtly supple- increase in size, they do not converge on uniformly tive forms such as mouse/mice and more regular populated paradigms. Instead, they reinforce pre- forms such as rat/rats is that the former serve to ac- viously attested forms and classes while introduc- celerate the rate at which a speakers’ representation ing progressively fewer new units. As shown in Copyright © by the paper’s authors. Copying permitted for private and academic purposes. In Vito Pirrelli, Claudia Marzi, Marcello Ferro (eds.): Word Structure and Word Usage. Proceedings of the NetWordS Final Conference, Pisa, March -April , , published at http://ceur-ws.org 29 Figure , the number of attested inflected noun References variants decreases in all random samples, ranging Baayen, R. H., Feldman, L. B. & Schreuder, R. from -million to -million hits, at which point (). Morphological influences on the recog- the -million word StdeWaC corpus is essen- nition of monosyllabic monomorphemic words. tially exhausted. As sample size increases, there is Journal of Memory and Language , –. a marked attenuation in the steepness of the slope steepness, though it never becomes completely flat. Baayen, R. H., McQueen, J. M., Dijkstra, T. & This trend is extracted and presented in Figure , Schreuder, R. (). Frequency effects in reg- which plots number of attested forms on the X- ular inflectional morphology: Revisiting Dutch axis and slopes of six trends from Figure  on the plurals. In Baayen, R. H. & Schreuder, R. (eds.), Y-axis. From this relationship we can infer that Morphological Structure in Language Processing, even if the corpus size were increased to infinity, Berlin: Mouton de Gruyter, –. it would never contain all possible inflected forms Davis, M., Marslen-Wilson, W. D. & Gaskell, M. of every German noun. As shown in Figure , the (). Leading up the lexical garden-path: Seg- forms of a language obey Zipf ’s law at all sample mentation and ambiguity in spoken word recog- sizes. Speakers must be able to extrapolate from a nition. Journal of Experimental Psychology: Hu- partial – often sparse – sample of their language, man Perception & Performance , –. and regular patterns subserve this need. Gahl, S., Yao, Y. & Johnson, K. (). Why re- duce? Phonological neighborhood density and phonetic reduction in spontaneous speech. Jour- It takes a neighbourhood nal of Memory and Language (), –. In order for a collection of partial samples to al- Henderson, J. E. (). Phonology and Grammar low the generation of unattested forms, the forms of Yele, Papua New Guinea. Pacific Linguistics B- that speakers do know must be organized into sys- , Camberra: Pacific Linguistics. tematic structures that collectively enable the scope Hockett, C. F. (). The Yawelmani basic verb. of possible variations to be realized. These struc- Language , –. tures correspond to lexical neigbourhoods, whose Kemps, J. J. K., Rachèl, Ernestus, M., Schreuder, R. effects have been investigated in a wide range of & Baayen, R. H. (). Prosodic cues for mor- psycholinguistic studies (Baayen et al. ; Gahl phological complexity: The case of Dutch plural et al. ). From the present perspective, neigh- nouns. Memory & Cognition (), –. bourhoods are not independent dimensions of lex- ical organization but, rather, constitute the cre- Milin, P., Keuleers, E. & Filipović Đurdjević, ative engine of the morphological system, permit- D. (). Allomorphic responses in Serbian ting the extrapolation of the full system from par- pseudo-nouns as a result of analogical learning. tial patterns. Interesting support for this perspec- Acta Linguistica Hungarica , –. tive comes from the study reported in Milin et al. Plag, I., Homan, J. & Kunter, G. (). Ho- (). In this study, analogical extrapolation from mophony and morphology: The acoustics of a small set of nearest neighbors allowed a system to word-final S in English. Ms, Heinrich-Heine- model the choice of masculine instrumental singu- Universität, Düsseldorf. lar allomorph by Serbian speakers presented with Ramscar, M., Dye, M. & McCauley, S. M. (). nonce words. Regular paradigms thus enable lan- Error and expectation in language learning: The guage users to generate previously unencountered curious absence of mouses in adult speech. Lan- forms, not because they are the product of an ex- guage (), –. plicit rule, or of any kind of explicit grammatical knowledge, but rather they are implicit in the dis- tribution of forms and semantics in the language as a system, much as suggested by Hockett (: ). in his analogizing … [t]he native user of the language … operates in terms of all sorts of internally stored paradigms, many of them doubtless only partial 30 12.5 Log−count of nouns sampleSize 10.0 1M 3M 6M 9M 7.5 12M 15M 5.0 1 2 3 4 Number of noun infl. variants Figure : The paradigm non-filling pattern −1.5 Slope estimates for log−count of nouns −2.0 −2.5 −3.0 1M 3M 6M 9M 12M 15M Number of forms Figure : Asymptoting slopes 8M Sample sizes (and number of hapax legomena): 6M 1M (1107) 3M (2305) E[Vm] 6M (3187) 9M (8035) 4M 12M (8633) 15M (7365) 2M 1 2 3 4 5 6 7 8 9 10 11 12 ... ... m Figure : Zipf plot for randomly sampled words 31