=Paper= {{Paper |id=Vol-1347/paper06 |storemode=property |title=Zipfian discrimination |pdfUrl=https://ceur-ws.org/Vol-1347/paper06.pdf |volume=Vol-1347 |dblpUrl=https://dblp.org/rec/conf/networds/BlevinsMR15 }} ==Zipfian discrimination== https://ceur-ws.org/Vol-1347/paper06.pdf
                                           Zipfian discrimination
                         Jim Blevins                              Petar Milin
                   University of Cambridge                   University of Novi Sad
                       jpb39@cam.ac.uk                 Eberhard Karls Universität Tübingen
                                                        petar.milin@uni-tuebingen.de

                                             Michael Ramscar
                                     Eberhard Karls Universität Tübingen
                                   michael.ramscar@uni-tuebingen.de

This talk outlines how form variation can be mod-               of a specific form/meaning contrast becomes dis-
elled in terms of equilibria between two domi-                  criminated from the form classes that express sim-
nant communicative pressures. The pressure to                   ilar contrasts. Thus all learning serves to increase
discriminate forms of a language enhances differ-                the level of suppletion in form-meaning mappings.
ences between expressions. Unchecked, this pres-                   Moreover, standard cases of ‘suppletion’ are
sure can in principle lead to suppletion of the kind            merely extreme instances of discriminative con-
reported in languages such as Yélî Dnye (Hen-                   trasts that seem ubiquitous at the sub-phonemic
derson ). However, in most languages, the                   level. In the domain of word formation, Davis
pressure towards maximally discriminative expres-               et al. () found suggestive differences in dura-
sions is countered by the need to extrapolate from              tion and fundamental frequency between a word
sparse input. It has long been known that corpora               like captain and a morphologically unrelated on-
provide only a partial coverage of the forms of a               set word such as cap. Of more direct relevance
language (inflectional and derivational). This talk              are studies of inflectional formations. Baayen et al.
presents evidence that the shortfall is far greater             () found that a sample of speakers produced
and far more systematic than previously appreci-                Dutch nouns with a longer mean duration when
ated, and that the coverage of the form variation re-           they occurred as singulars than as when they oc-
mains sparse in corpora of up to one billion words.             curred as the stem of the corresponding plural. In
The sampling reported in this talk suggests that the            a follow-up study, Kemps et al. () tested speak-
forms in a corpus or encountered by a speaker ex-               ers’ sensitivity to prosodic differences, and con-
hibit a Zipfian distribution at all sample sizes.                cluded that “acoustic differences exist between un-
   The interaction of these pressures also accounts             inflected and inflected forms and that listeners are
for the role of lexical neighbourhoods. Since most              sensitive to them” (Kemps et al. : ). Recent
paradigms will be only partially attested, the orga-            studies by Plag et al. () find similar contrasts
nization of paradigms into neighbourhoods pro-                  between phonemically identical affixes in English.
vides an analogical base for extrapolation.
                                                                The role of discriminability
The status of regularity
                                                                From a discriminative perspective, it is regularity
It is usually assumed that regularity in a linguistic           that stands in need of explanation. Learning mod-
system is desirable or normative and that supple-               els offer a solution here as well. Unlike derivational
tion and other irregularities represent deviations              processes, inflectional processes are traditionally
from the uniform patterns that systems (or their                assumed to be highly productive, defining uniform
speakers) strive to maintain. From a discrimina-                paradigms within a given class. Lemma size is thus
tive perspective, the situation is exactly reversed.            not expected to vary, except where forms are un-
To the extent that patterns like suppletion enhance             available due to paradigm ‘gaps’ or ‘defectiveness’.
the discriminability of forms, they contribute to the           Yet corpus studies suggest that this expectation
communicative efficiency of a language. In a dis-                 is an idealization. Many potentially available in-
criminative model, such as that of Ramscar et al.               flected forms are unattested in corpora. As corpora
(), the only difference between overtly supple-              increase in size, they do not converge on uniformly
tive forms such as mouse/mice and more regular                  populated paradigms. Instead, they reinforce pre-
forms such as rat/rats is that the former serve to ac-          viously attested forms and classes while introduc-
celerate the rate at which a speakers’ representation           ing progressively fewer new units. As shown in

                  Copyright © by the paper’s authors. Copying permitted for private and academic purposes.
 In Vito Pirrelli, Claudia Marzi, Marcello Ferro (eds.): Word Structure and Word Usage. Proceedings of the NetWordS Final
                            Conference, Pisa, March -April , , published at http://ceur-ws.org
                                                           29
Figure , the number of attested inflected noun               References
variants decreases in all random samples, ranging
                                                             Baayen, R. H., Feldman, L. B. & Schreuder, R.
from -million to -million hits, at which point
                                                               (). Morphological influences on the recog-
the -million word StdeWaC corpus is essen-
                                                               nition of monosyllabic monomorphemic words.
tially exhausted. As sample size increases, there is
                                                               Journal of Memory and Language , –.
a marked attenuation in the steepness of the slope
steepness, though it never becomes completely flat.           Baayen, R. H., McQueen, J. M., Dijkstra, T. &
This trend is extracted and presented in Figure ,             Schreuder, R. (). Frequency effects in reg-
which plots number of attested forms on the X-                 ular inflectional morphology: Revisiting Dutch
axis and slopes of six trends from Figure  on the             plurals. In Baayen, R. H. & Schreuder, R. (eds.),
Y-axis. From this relationship we can infer that               Morphological Structure in Language Processing,
even if the corpus size were increased to infinity,             Berlin: Mouton de Gruyter, –.
it would never contain all possible inflected forms           Davis, M., Marslen-Wilson, W. D. & Gaskell, M.
of every German noun. As shown in Figure , the               (). Leading up the lexical garden-path: Seg-
forms of a language obey Zipf ’s law at all sample            mentation and ambiguity in spoken word recog-
sizes. Speakers must be able to extrapolate from a            nition. Journal of Experimental Psychology: Hu-
partial – often sparse – sample of their language,            man Perception & Performance , –.
and regular patterns subserve this need.
                                                             Gahl, S., Yao, Y. & Johnson, K. (). Why re-
                                                              duce? Phonological neighborhood density and
                                                              phonetic reduction in spontaneous speech. Jour-
It takes a neighbourhood
                                                              nal of Memory and Language (), –.
In order for a collection of partial samples to al-          Henderson, J. E. (). Phonology and Grammar
low the generation of unattested forms, the forms             of Yele, Papua New Guinea. Pacific Linguistics B-
that speakers do know must be organized into sys-             , Camberra: Pacific Linguistics.
tematic structures that collectively enable the scope        Hockett, C. F. (). The Yawelmani basic verb.
of possible variations to be realized. These struc-           Language , –.
tures correspond to lexical neigbourhoods, whose
                                                             Kemps, J. J. K., Rachèl, Ernestus, M., Schreuder, R.
effects have been investigated in a wide range of
                                                               & Baayen, R. H. (). Prosodic cues for mor-
psycholinguistic studies (Baayen et al. ; Gahl
                                                               phological complexity: The case of Dutch plural
et al. ). From the present perspective, neigh-
                                                               nouns. Memory & Cognition (), –.
bourhoods are not independent dimensions of lex-
ical organization but, rather, constitute the cre-           Milin, P., Keuleers, E. & Filipović Đurdjević,
ative engine of the morphological system, permit-             D. (). Allomorphic responses in Serbian
ting the extrapolation of the full system from par-           pseudo-nouns as a result of analogical learning.
tial patterns. Interesting support for this perspec-          Acta Linguistica Hungarica , –.
tive comes from the study reported in Milin et al.           Plag, I., Homan, J. & Kunter, G. (). Ho-
(). In this study, analogical extrapolation from           mophony and morphology: The acoustics of
a small set of nearest neighbors allowed a system to           word-final S in English. Ms, Heinrich-Heine-
model the choice of masculine instrumental singu-              Universität, Düsseldorf.
lar allomorph by Serbian speakers presented with             Ramscar, M., Dye, M. & McCauley, S. M. ().
nonce words. Regular paradigms thus enable lan-                Error and expectation in language learning: The
guage users to generate previously unencountered               curious absence of mouses in adult speech. Lan-
forms, not because they are the product of an ex-              guage (), –.
plicit rule, or of any kind of explicit grammatical
knowledge, but rather they are implicit in the dis-
tribution of forms and semantics in the language as
a system, much as suggested by Hockett (: ).


     in his analogizing … [t]he native user
     of the language … operates in terms of
     all sorts of internally stored paradigms,
     many of them doubtless only partial


                                                        30
                                         12.5



   Log−count of nouns                                                                                                            sampleSize

                                         10.0                                                                                       1M
                                                                                                                                    3M
                                                                                                                                    6M
                                                                                                                                    9M
                                          7.5                                                                                       12M
                                                                                                                                    15M



                                          5.0




                                                         1                2                    3                   4
                                                                  Number of noun infl. variants



                                           Figure : The paradigm non-filling pattern



                                         −1.5
Slope estimates for log−count of nouns




                                         −2.0




                                         −2.5




                                         −3.0



                                                    1M           3M               6M           9M            12M           15M
                                                                                  Number of forms



                                                    Figure : Asymptoting slopes




                                         8M


                                                                                                                         Sample sizes
                                                                                                                         (and number of hapax legomena):
                                         6M                                                                                 1M (1107)
                                                                                                                            3M (2305)
   E[Vm]




                                                                                                                            6M (3187)
                                                                                                                            9M (8035)
                                         4M
                                                                                                                            12M (8633)
                                                                                                                            15M (7365)


                                         2M




                                                1   2    3   4   5    6       7       8   9   10   11   12   ...   ...
                                                                                  m



 Figure : Zipf plot for randomly sampled words




                                                                                                                            31