Of crowds and corpora: A marriage of measures Emmanuel Keuleers, Paweł Mandera, Michaël Stevens, & Marc Brysbaert Department of Experimental Psychology Ghent University Henri Dunantlaan 2, 9000 Gent, Belgium {emmanuel.keuleers, pawel.mandera, michael.stevens, marc.brysbaert}@ugent.be Abstract fewer of them. Following this reasoning, the estimate of the number of language users We discuss the relationship between a who know a word, or word prevalence may word's corpus frequency and its preval- ence –the proportion of people who know give a better indication of occurrence than the word– and show that they are com- corpus frequency counts. plementary measures. We show that adding word prevalence as a predictor of 1.2 Where the corpus is strong the lexical decision reaction time in the crowd is weak Dutch lexicon project increases explained variance by more than 10%. In addition, On the other hand, consider presenting the we show that, for the same dataset, word same random sample of people with words prevalence is the best independent pre- from the language's core vocabulary. Since dictor of word processing time. these words will be known to all of the judges, prevalence will be singularly high 1 Introduction and uninformative. In this case corpus counts Word frequency is one of the most important should be a much better estimate of occur- measures in the cognitive study of word pro- rence. cessing, both theoretically and methodologically. Its contribution in explaining behavioural meas- 2 Testing the prevalence measure ures such as reaction time is so large that re- searchers take great care in collecting large and To test the complementarity of prevalence reliable corpora and in applying the best possible and frequency as measures of occurrence, we word frequency estimates in their research. used prevalence norms for Dutch collected through a lexical decision task presented as 1.1 Where the corpus is weak the crowd an online vocabulary test (Keuleers, Stevens, is strong Mandera, & Brysbaert, in press). Each par- ticipants saw 100 stimuli (about 70 words A drawback of frequency counts is that, re- and 30 nonwords) selected randomly from a gardless of corpus size, lower counts are un- list of 54,319 words and 21,734 nonwords. reliable. As an example, consider asking a In the current analysis, we used the data of random sample of 100 people whether they 190,771 participants who indicated that they know each of the word types that occur just were living in Belgium, giving us about 250 once in a large corpus. Although frequency observations per word. The score for a word for all these types is equal, the number of obtained by fitting a Rasch model –a mathe- judges knowing each word will vary from matical model simultaneously ranking partic- zero to one hundred and, as the judges are ipants by ability and test-items by difficulty– language users, words known to many of to the data was considered an operationaliza- them may be considered to occur more often tion of its prevalence. in language than words which are known by Copyright © by the paper’s authors. Copying permitted for private and academic purposes. In Vito Pirrelli, Claudia Marzi, Marcello Ferro (eds.): Word Structure and Word Usage. Proceedings of the NetWordS Final Conference, Pisa, March 30-April 1, 2015, published at http://ceur-ws.org 10 icon ProjectTable 1 shows that the correla- tion between prevalence and frequency was relatively low (.34), giving further evidence that prevalence is distinct from word fre- quency and contextual diversity –a word's document count– which correlates very highly with word frequency. Finally, we used the data from the 7,885 items in the Dutch Lexicon Project (Keuleers et al., 2010) for which both frequency and prevalence were available to examine the contributions of Dutch corpus word fre- quency (SUBTLEX-NL, Keuleers et al., 2010) and word prevalence on average reac- tion times. In single variable analyses, log word fre- Figure 1: The relationship between frequency and preval- quency explained about 36.13% of the vari- ence. Word frequency is displayed as Zipf-score (log fre- quency per billion words; Van Heuven et al., 2014). ance in reaction times and prevalence ex- plained about 33.03% of the variance in re- action times. Figure 1 shows the complementary relation between the SUBTLEX-NL word frequen- This was also made clear when both mea- cies (based on 42 million word corpus of sures were considered in the same analysis, film and television subtitles; see Keuleers, where both measures jointly explained 51.37 Brysbaert, & New, 2012) and the prevalence % of the variance in reaction times. The measure obtained from the online vocabulary unique contributions to explained variance test. Higher z-scores indicate more prevalent (eta-squared) were 27.39% for frequency and words. The dark lines at the bottom half of 23.87% for prevalence. In further analyses, the plot indicate words with singularly low we found that including the quadratic trend frequencies over a large range of prevalence. of word frequency and contextual diversity The elongated cluster at the right side of the did not substantially alter this pattern of re- plot shows words with nearly full prevalence sults. over large frequency ranges. 3 Conclusion In addition, we investigated the relationship between prevalence and other typical mea- The results show that, next to word fre- sures of word frequency. Table 1 gives an quency, prevalence is by far the most impor- overview of these correlations. tant independent contributor to visual word recognition times, suggesting that prevalence should be included in any analysis where Frequency Prevalence OLD 20 Length word corpus frequency is considered to be Frequency 1.00 0.35 -0.34 -0.37 relevant. However, several questions remain Prevalence 0.35 1.00 0.00 0.07 open. First, what is the influence of corpus OLD20 -0.34 0.00 1.00 0.74 size on the relation between corpus word fre- Length -0.37 0.07 0.74 1.00 quency and prevalence and on the contribu- Contextual 0.98 0.36 -0.34 -0.35 tion of prevalence to lexical processing? Sec- Diversity ond, how well does prevalence perform on others tasks and in other languages? Finally, Table 1: Correlations between main predict- does the effect of prevalence on word pro- ors of Lexical Decision RT in the Dutch Lex- cessing truly lie in a better measurement of 11 word occurrence or does it partly reflect an independent property associated with the learnability of a word? Acknowledgments The text of this abstract is an early summary of find - ings from a larger study reported in the Quarterly Journal of Experimental Psychology as Word knowl- edge in the crowd: Measuring vocabulary size and word prevalence in a massive online experiment. (Keuleers, E., Stevens, M., Mandera, P., & Brysbaert, M., in press). References Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., … Treiman, R. (2007). The English lexicon project. Behavior Re- search Methods, 39(3), 445–459. Keuleers, E., Brysbaert, M., & New, B. (2010). SUB- TLEX-NL: A new measure for Dutch word fre- quency based on film subtitles. Behavior Research Methods, 4 2( 3 ) , 643–650. doi:10.3758/BRM.42.3.643 Keuleers, E., Diependaele, K., & Brysbaert, M. (2010). Practice Effects in Large-Scale Visual Word Recognition Studies: A Lexical Decision Study on 14,000 Dutch Mono- and Disyllabic Words and Nonwords. Frontiers in Psychology, 1. doi:10.3389/fpsyg.2010.00174 Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2011). The British Lexicon Project: Lexical de- cision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44(1), 287–304. doi:10.3758/s13428-011-0118-4 Keuleers, E., Stevens, M., Mandera, P., & Brysbaert, M. (in press). Word knowledge in the crowd: Mea- suring vocabulary size and word prevalence in a massive online experiment. Quarterly Journal of Experimental Psychology. doi:10.1080/17470218.2015.1022560 12