=Paper= {{Paper |id=Vol-1347/paper02 |storemode=property |title=Of crowds and corpora: a marriage of measures |pdfUrl=https://ceur-ws.org/Vol-1347/paper02.pdf |volume=Vol-1347 |dblpUrl=https://dblp.org/rec/conf/networds/KeuleersMSB15 }} ==Of crowds and corpora: a marriage of measures== https://ceur-ws.org/Vol-1347/paper02.pdf
                   Of crowds and corpora: A marriage of measures
         Emmanuel Keuleers, Paweł Mandera, Michaël Stevens, & Marc Brysbaert
                        Department of Experimental Psychology
                                  Ghent University
                        Henri Dunantlaan 2, 9000 Gent, Belgium
                   {emmanuel.keuleers, pawel.mandera,
              michael.stevens, marc.brysbaert}@ugent.be




                       Abstract                                 fewer of them. Following this reasoning, the
                                                                estimate of the number of language users
    We discuss the relationship between a
                                                                who know a word, or word prevalence may
    word's corpus frequency and its preval-
    ence –the proportion of people who know                     give a better indication of occurrence than
    the word– and show that they are com-                       corpus frequency counts.
    plementary measures. We show that
    adding word prevalence as a predictor of                    1.2    Where the corpus is strong the
    lexical decision reaction time in the                              crowd is weak
    Dutch lexicon project increases explained
    variance by more than 10%. In addition,                     On the other hand, consider presenting the
    we show that, for the same dataset, word                    same random sample of people with words
    prevalence is the best independent pre-                     from the language's core vocabulary. Since
    dictor of word processing time.                             these words will be known to all of the
                                                                judges, prevalence will be singularly high
1     Introduction
                                                                and uninformative. In this case corpus counts
Word frequency is one of the most important                     should be a much better estimate of occur-
measures in the cognitive study of word pro-                    rence.
cessing, both theoretically and methodologically.
Its contribution in explaining behavioural meas-                2     Testing the prevalence measure
ures such as reaction time is so large that re-
searchers take great care in collecting large and               To test the complementarity of prevalence
reliable corpora and in applying the best possible              and frequency as measures of occurrence, we
word frequency estimates in their research.                     used prevalence norms for Dutch collected
                                                                through a lexical decision task presented as
1.1    Where the corpus is weak the crowd
                                                                an online vocabulary test (Keuleers, Stevens,
       is strong
                                                                Mandera, & Brysbaert, in press). Each par-
                                                                ticipants saw 100 stimuli (about 70 words
A drawback of frequency counts is that, re-
                                                                and 30 nonwords) selected randomly from a
gardless of corpus size, lower counts are un-
                                                                list of 54,319 words and 21,734 nonwords.
reliable. As an example, consider asking a
                                                                In the current analysis, we used the data of
random sample of 100 people whether they
                                                                190,771 participants who indicated that they
know each of the word types that occur just
                                                                were living in Belgium, giving us about 250
once in a large corpus. Although frequency
                                                                observations per word. The score for a word
for all these types is equal, the number of
                                                                obtained by fitting a Rasch model –a mathe-
judges knowing each word will vary from
                                                                matical model simultaneously ranking partic-
zero to one hundred and, as the judges are
                                                                ipants by ability and test-items by difficulty–
language users, words known to many of
                                                                to the data was considered an operationaliza-
them may be considered to occur more often
                                                                tion of its prevalence.
in language than words which are known by
           Copyright © by the paper’s authors. Copying permitted for private and academic purposes.
 In Vito Pirrelli, Claudia Marzi, Marcello Ferro (eds.): Word Structure and Word Usage. Proceedings of the NetWordS Final
                           Conference, Pisa, March 30-April 1, 2015, published at http://ceur-ws.org

                                                           10
                                                                 icon ProjectTable 1 shows that the correla-
                                                                 tion between prevalence and frequency was
                                                                 relatively low (.34), giving further evidence
                                                                 that prevalence is distinct from word fre-
                                                                 quency and contextual diversity –a word's
                                                                 document count– which correlates very
                                                                 highly with word frequency.

                                                                 Finally, we used the data from the 7,885
                                                                 items in the Dutch Lexicon Project (Keuleers
                                                                 et al., 2010) for which both frequency and
                                                                 prevalence were available to examine the
                                                                 contributions of Dutch corpus word fre-
                                                                 quency (SUBTLEX-NL, Keuleers et al.,
                                                                 2010) and word prevalence on average reac-
                                                                 tion times.
                                                                 In single variable analyses, log word fre-
Figure 1: The relationship between frequency and preval-         quency explained about 36.13% of the vari-
ence. Word frequency is displayed as Zipf-score (log fre-
quency per billion words; Van Heuven et al., 2014).              ance in reaction times and prevalence ex-
                                                                 plained about 33.03% of the variance in re-
                                                                 action times.
Figure 1 shows the complementary relation
between the SUBTLEX-NL word frequen-                             This was also made clear when both mea-
cies (based on 42 million word corpus of                         sures were considered in the same analysis,
film and television subtitles; see Keuleers,                     where both measures jointly explained 51.37
Brysbaert, & New, 2012) and the prevalence                       % of the variance in reaction times. The
measure obtained from the online vocabulary                      unique contributions to explained variance
test. Higher z-scores indicate more prevalent                    (eta-squared) were 27.39% for frequency and
words. The dark lines at the bottom half of                      23.87% for prevalence. In further analyses,
the plot indicate words with singularly low                      we found that including the quadratic trend
frequencies over a large range of prevalence.                    of word frequency and contextual diversity
The elongated cluster at the right side of the                   did not substantially alter this pattern of re-
plot shows words with nearly full prevalence                     sults.
over large frequency ranges.
                                                                 3   Conclusion
In addition, we investigated the relationship
between prevalence and other typical mea-                        The results show that, next to word fre-
sures of word frequency. Table 1 gives an                        quency, prevalence is by far the most impor-
overview of these correlations.                                  tant independent contributor to visual word
                                                                 recognition times, suggesting that prevalence
                                                                 should be included in any analysis where
             Frequency Prevalence OLD 20 Length
                                                                 word corpus frequency is considered to be
Frequency      1.00       0.35      -0.34    -0.37               relevant. However, several questions remain
Prevalence     0.35       1.00       0.00    0.07                open. First, what is the influence of corpus
OLD20          -0.34      0.00       1.00    0.74                size on the relation between corpus word fre-
Length         -0.37      0.07       0.74    1.00                quency and prevalence and on the contribu-
Contextual     0.98       0.36      -0.34    -0.35               tion of prevalence to lexical processing? Sec-
Diversity
                                                                 ond, how well does prevalence perform on
                                                                 others tasks and in other languages? Finally,
Table 1: Correlations between main predict-                      does the effect of prevalence on word pro-
ors of Lexical Decision RT in the Dutch Lex-                     cessing truly lie in a better measurement of




                                                            11
word occurrence or does it partly reflect an
independent property associated with the
learnability of a word?


Acknowledgments
The text of this abstract is an early summary of find -
ings from a larger study reported in the Quarterly
Journal of Experimental Psychology as Word knowl-
edge in the crowd: Measuring vocabulary size and
word prevalence in a massive online experiment.
(Keuleers, E., Stevens, M., Mandera, P., & Brysbaert,
M., in press).

References
Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese,
  M. J., Kessler, B., Loftis, B., … Treiman, R.
  (2007). The English lexicon project. Behavior Re-
  search Methods, 39(3), 445–459.
Keuleers, E., Brysbaert, M., & New, B. (2010). SUB-
  TLEX-NL: A new measure for Dutch word fre-
  quency based on film subtitles. Behavior Research
  Methods,               4 2( 3 ) ,   643–650.
  doi:10.3758/BRM.42.3.643
Keuleers, E., Diependaele, K., & Brysbaert, M.
  (2010). Practice Effects in Large-Scale Visual
  Word Recognition Studies: A Lexical Decision
  Study on 14,000 Dutch Mono- and Disyllabic
  Words and Nonwords. Frontiers in Psychology, 1.
  doi:10.3389/fpsyg.2010.00174
Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M.
  (2011). The British Lexicon Project: Lexical de-
  cision data for 28,730 monosyllabic and disyllabic
  English words. Behavior Research Methods, 44(1),
  287–304. doi:10.3758/s13428-011-0118-4
Keuleers, E., Stevens, M., Mandera, P., & Brysbaert,
 M. (in press). Word knowledge in the crowd: Mea-
 suring vocabulary size and word prevalence in a
 massive online experiment. Quarterly Journal of
 Experimental Psychology.
 doi:10.1080/17470218.2015.1022560




                                                          12