ZipfExplorer: A Tool for the Comparison of Shared Lexis Steven Coats[0000-0002-7295-3893] English, University of Oulu, 90014 Oulu, Finland steven.coats@oulu.fi Abstract. Word frequency statistics and lexical diversity measures can provide insights into discourse differences between texts. The ZipfExplorer, a tool and online app for the interactive visualization and comparison of word frequencies in two texts, shows side-by-side rank-frequency profiles and interactive tables of shared lexis, enabling keyword analysis and shedding light on discourse differ- ences. Four lexical diversity measures (type-token ratio, Gini coefficient, power- law alpha parameter, and Shannon entropy) are calculated for the shared word types. Word frequency information is provided for a selection of mainly literary texts, and users can upload their own files. This paper provides an overview of the visualization of word frequency distributions, describes the functionality of the ZipfExplorer tool and demonstrates some of its features, and briefly discusses the lexical diversity measures calculated by the tool. Keywords: Word Frequencies, Visualization, Lexical Diversity, Zipf. 1 Introduction Word frequencies are a fundamental starting point for many analytical procedures in corpus-based linguistic, literary, or cultural analysis and for natural language pro- cessing tasks.1 The study of word frequency distributions and their statistical properties continues to be an active topic of research in computational linguistics [2, 3, 4, 5, 6, 7,8], and in recent years, the analysis of word frequencies has been facilitated by the availability of large corpora or other data sets and open access to data via platforms such as CLARIN, GitHub, or the Center for Open Science as well as by dedicated li- braries of scripting functions in popular programming languages such as R or Python [9, 10, 11, 12]. The representation of word frequencies in an interactive visualization format, however, has not generally been a primary focus, despite the fact that interactive visualizations can facilitate exploratory data analysis, enhance pedagogy, and comple- ment textual presentation of research [13, 14]. In language and linguistics or literary or cultural studies, the comparison of word frequencies in two texts or between a selected text and a reference corpus is a primary method for gaining insight into differences in discourse content. The ZipfExplorer2 is an online tool for the interactive visualization of word frequencies in texts or corpora, 1 This paper, an expanded version of [1], includes a more detailed discussion of the Zipf distribution and the lexical diversity measures calculated by the ZipfExplorer. In addition, some code changes have been made to enhance the useability of the tool. 2 https://zipfexplorer.herokuapp.com Copyright ยฉ 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 146 named after Zipfโ€™s Law [15, 16], the fact that for most longer natural language texts or corpora, the frequency of a given word type is approximately inversely proportional to its rank in a sorted list of the frequencies of word types for the text. The ZipfExplorer provides an interactive means to show the concept of โ€œkeynessโ€ [17, 18], or the extent to which a lexical item occurs more often or less often than would be expected in com- parison to a reference text. The tool shows word frequency distributions for the textual overlap of two texts, or the word types that they share, a text aspect that may also be of theoretical interest in terms of its relationship to the concept of textual entailment, or recognizing, given two text fragments, whether the meaning of one text can be inferred (entailed) from the other [19], as well as to word error rate and derived measures of textual similarity used in speech recognition [20]. In addition, the tool, built using the Bokeh module in Python [21], calculates several lexical diversity measures (type-token ratio, Gini coefficient, power-law alpha parameter, and Shannon entropy). The code for the tool is publicly available.3 2 Background Among the first to systematically study lexical type frequencies was the early 20th-cen- tury American Germanist George Zipf, who noted that when the words of a text are ordered in decreasing frequency, the relationship between a the frequency and the rank for a word of rank r can be expressed as ๐‘“๐‘Ÿ โ‰ˆ ๐ถ๐‘Ÿ โˆ’1 , where C represents a constant. A Zipfian rank-frequency profile, when plotted in double logarithmic space, is typically close to a straight line, but the shape of a frequency distribution for only those lexical types that are shared with a comparison text or reference corpus depends not only on the frequency information of the particular texts under consideration, but also on the degree of textual overlap between the two texts. Visualizations of shared lexis, in addi- tion to highlighting discourse similarities and differences between texts through the examination of particular word types, can also give insight into the interplay between frequencies, derived lexical diversity measures, and the shape of discrete frequency distributions in general. Following Zipf [15, pp. 45โ€“48; 16, p. 25], word frequency distributions are typically displayed in double logarithmic space, with frequency on the y-axis and frequency rank on the x-axis, as in the top right quadrant of Figure 1, which shows four visualizations of the word frequency distribution for Charles Dickensโ€™ 1859 novel A Tale of Two Cit- ies. Each circle on the plot corresponds to a distinct word type. The most frequent type, at the top left of the plot, is the word type โ€œtheโ€, occurring 8,058 times in the text, followed by โ€œandโ€, โ€œofโ€ and other common words. 3 https://github.com/stcoats/zipf_explorer 147 Fig. 1. Four representations of word frequency information for A Tale of Two Cities. The plot in the top left represents the same information in linear space, whereas the lower left plot is the so-called degree distribution (sometimes also referred to as the frequency spectrum): Here, the word frequency counts themselves have been binned, so that the top left circle is the proportion of all word types that occur once in the novel (the hapax legomena). Hapax comprise 47% of the word types in the novel; words that occur twice (dis legomena) 16%, and so on. While the information contained in the Zipf rank-frequency plot and the degree distribution plot is equivalent, the latter plot is more difficult to interpret in terms of discourse, as points on the plot do not correspond to individual word types. In the bottom left of Figure 1, the complementary cumulative 148 distribution function is depicted: the cumulative proportion of types with a frequency equal to or greater than a given frequency. Thus, 100% of types in the novel occur at least once, 53% at least twice, 37% at least three times, and so on. The complimentary cumulative distribution visualization is the reflection of the Zipf double-logarithmic profile across a line extending from the bottom left to the top right of the subplot. Be- cause the upper two plots in Figure 1 are intuitively easier to understand, the ZipfEx- plorer visualizes rank-frequency utilizes them, rather than the degree distribution or the complementary cumulative distribution. 3 Tool Functionality In Figure 2 the default linear-scale view for the shared vocabulary types in Mary Shel- leyโ€™s Frankenstein and H. G. Wellsโ€™ War of the Worlds is depicted: Each subplot shows the rank-frequency profile for the text selected via the dropdown menus to the right of the plots. Points on the plots show word relative frequency (per 10,000 words) on the y-axis and type rank in an ordered list of the frequencies of all words in the shared lexis on the x-axis. Values for the lexical diversity measures type-token ratio, Gini coeffi- cient, alpha exponent of the best-fit power-law distribution, and Shannon entropy are shown above the plots. Hovering over a word type will show its rank, frequency, rela- tive frequency, and the log-likelihood measure [22, 23] and associated p-value com- pared to the shared lexis of the comparison text. Fig. 2. Default tool view. Words can be highlighted with a hover tool and selected with a box-drawing tool (in the toolset above the right-hand subplot). Selected words are highlighted in the sortable tables below the plots; clicking on a word in one of the tables highlights it in the plots. 149 The tables show frequency rank, the word form, frequency, relative frequency, differ- ence in relative frequency compared to the other text, and the log-likelihood value: higher log-likelihood values indicate are calculated for types with larger frequency dif- ferences. The default texts available for comparison are selectable via a drop-down menu to the right of the plots. In addition, users can upload their own texts for comparison with the upload buttons. A โ€˜Remove most frequent wordsโ€™ drop-down list removes 0, 10, 20, 50, 100, or 200 of the most frequent words in English, based on the Project Gutenberg English Corpus from Sketch Engine [24]. As many of the most frequent words are de- terminers, prepositions, conjunctions, or other function words that bear relatively little semantic information, removing frequent words can help to highlight content and dis- course differences between the texts. Below the remove words drop-down menu, the total number of types and tokens in the original texts is shown along with the percentage of types that are shared in the two texts. To examine the word frequency distribution of a single text, rather than the distributions of the shared lexis in two texts, the same text can be selected for both plot windows. The source texts are a selection of mainly literary texts from Project Gutenberg, a corpus of inaugural addresses of U.S. presidents from NLTK [25], the Brown Corpus and its subsections [26], and the Freiburg-Brown Corpus of American English [27]. 3.1 Sorting The columns in the tables below the subplots can be sorted. They show original word order in the left-hand text, word form, rank in the frequency table, relative frequency, difference to the other text in relative frequency, and log-likelihood score. Sorting can show items that are much more relatively frequent in a text. In Figure 2, the personal pronouns โ€˜myโ€™, โ€˜youโ€™, and โ€˜Iโ€™ are more frequent in Frankenstein, a text with a first- person point of view, than in the third-person War of the Worlds. 150 Fig. 3. My, you, and I in Frankenstein and War of the Worlds. The types โ€˜up, โ€˜outโ€™, and โ€˜thereโ€™ (Fig. 4) are more relatively frequent in War of the Worlds; when considered along with other prepositions, place adverbials and location names, it becomes clear that spatial organization plays a greater role as a narrative ele- ment in War of the Worlds than in Frankenstein. Fig. 4. Up, out, and there in Frankenstein and War of the Worlds. 151 โ€˜ 3.2 Hapax Types Hapax can also shed light on discourse differences. Highlighting the hapax types in Frankenstein (in the left-hand rank-frequency profile of Fig. 5) shows their ranks and relative frequencies in War of the Worlds: although many are also hapax in the other text, or are found mainly in the tail of the frequency distribution for Frankenstein, the types โ€˜smokeโ€™ and โ€˜redโ€™ are much higher in the profile in War of the Worlds โ€“ a fre- quency difference that reflects the discourse content of the latter novel. Fig. 5. Distribution of types that are hapax in Frankenstein. 3.3 Stopword Removal Using the drop-down menu to the right of the subplots, 10, 20, 50, 100, or 200 of the most-frequent types in the Gutenberg corpus can be removed from the visualizations and tables. Because these types, which are mostly function words such as determiners, pronouns, and relativizers or common verbs, structure texts in important ways but con- tribute relatively little to discourse content, removing them may serve to highlight dis- course differences between two texts. 152 In terms of the distribution shape and the derived lexical diversity statistics, the re- moval of common words has the effect of increasing the relative frequency of the re- maining words. In effect, removing stopwords tends to change the shape of the Zipf profile in double-logarithmic space to a more curvilinear form โ€“ when function words are no longer considered, word frequencies deviate substantially from a power-law dis- tribution. As can be expected, removal of common words tends to increase the lexical diversity of the texts for the remaining shared types, which tend to be more uniformly distributed in terms of their relative frequencies. 4 Lexical Diversity The ZipfExplorer displays four lexical diversity measures: the type-token ratio, the Gini coefficient, the exponent ฮฑ for the best fit of a power-law distribution, and the Shannon entropy ๐‘ฏ [3, 4, 28]. These measures, while related, can be used to highlight different ๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘‘๐‘–๐‘ ๐‘ก๐‘–๐‘›๐‘๐‘ก ๐‘ก๐‘ฆ๐‘๐‘’๐‘  aspects of lexical diversity. The type-token ratio, has a range in ๐‘›๐‘ข๐‘š๐‘๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘ก๐‘œ๐‘˜๐‘’๐‘›๐‘  the interval (0,1], with smaller values indicating less lexical diversity. The Gini coefficient, which can be calculated with 2 โˆ‘๐‘›๐‘– ๐‘–๐‘ฅ๐‘– ๐‘› + 1 ๐บ= โˆ’ ๐‘› โˆ‘๐‘›๐‘– ๐‘ฅ๐‘– ๐‘› ranges from 0 (no diversity) to 1 (maximum diversity) for n word types with relative frequencies x. The exponent ฮฑ results from the best-fit line to the degree distribution function (lower left plot in Fig. 1) for the frequency information for the shared lexical types, calculated using the powerlaw package in Python [9] with the equation ๐‘“(๐‘›) โˆ ๐‘›โˆ’๐›ผ . The alpha parameter is related to the slope z of the Zipf rank-frequency profile by ๐›ผ = 1 1 1 + (๐‘ง = ) [29]. The parameter typically ranges in value between ~1.5 and 3, alt- ๐‘ง ๐›ผโˆ’1 hough higher or lower values are calculated for the shared vocabulary of extremely dissimilar or extremely short texts.4 Shannon entropy [30], calculated with ๐‘› ๐ป = โˆ’ โˆ‘ ๐‘ฅ๐‘– log 2 ๐‘ฅ๐‘– ๐‘– has a maximum theoretical value of log2(n), for data consisting of n unique types. The diversity statistics calculated by the tool provide evidence for the sensitivity of lexical diversity measures to sample size [2, 4]. For the textual overlap between two texts or a text and a corpus, the shorter text will likely exhibit lower Gini values and a 4 In these cases, however, word frequencies are unlikely to be distributed according to a power law, and thus the measure is not necessarily a good diversity indicator. See Clauset, Shalizi and Newman (2009). 153 higher type-token ratio, whereas the longer text will exhibit a smaller ฮฑ exponent and a higher ๐‘ฏ value. Removing frequent words will often increase values for the type-token ratio and the alpha parameter, and decrease the values for the Gini coefficient and the Shannon entropy, although this depends on the texts in question, their original frequen- cies, and the degree of textual overlap. For texts with a relatively large proportion of shared types, such as two novels by the same author, and with the removal of frequent function words, the lexical diversity measures may give insight into topical diversity in terms of narrative development. For texts that share relatively few types, the relation- ship between the measure values and the properties of the underlying original texts is less straightforward. 5 Conclusion The ZipfExplorer enables the interactive exploration of word frequencies in the shared lexis of two comparison texts or corpora, potentially shedding light on discourse simi- larities and differences and properties of frequency distributions. The lexical diversity measures type-token ratio, Gini coefficient, alpha parameter of the power-law function, and Shannon entropy, calculated by the tool, vary according to text length and textual overlap and are also affected by the removal of common function words. In a pedagogical context, the ZipfExplorer provides a hands-on way to make fre- quency information concrete. Given the increasing importance of artificial intelligence models not only in linguistics and other sciences, but ultimately in many working-life and administrative domains and in the contexts of daily life, the tool can serve as a starting point for understanding how linguistic frequency distributions underlie the large data sets used train machine learning models. The tool may also be useful for the comparison of various discrete distributions in computational studies of language or digital humanities, and for applied analysis in literary, historical, or cultural studies in which โ€œdistant readingโ€ approaches are em- ployed. Planned further development of the tool is to allow upload of different file for- mats, enable text extraction from URLs, and enable automatic annotation of part-of- speech tags or named entities whose frequency distributions may be of interest. It is also hoped that other researchers will use the code for the tool (or parts thereof), avail- able at GitHub, in order to create new and exciting ways to visualize linguistic data such as word frequency information. References 1. Coats, S.: Comparing word frequencies and lexical diversity with the ZipfExplorer tool. In Sanita Reinsone, Inguna Skadiล†a, Anda Baklฤne and Jฤnis Daugavieti (eds.), Proceedings of the 5th Digital Humanities in the Nordic Countries Conference, Riga, Latvia, October 21โ€“ 23, 2020, pp. 219โ€“225. CEUR, Aachen, Germany (2020). 2. Baayen, R. H.: Word frequency distributions. Kluwer, Dordrecht (2001). 154 3. Bรฉrubรฉ, N., Sainte-Marie, M., Mongeon, P., Lariviรจre, V.: Words by the tail: Assessing lex- ical diversity in scholarly titles using frequency-rank distribution tail fits. PLoS ONE 13(7) (2018). 4. Clauset, A., Shalizi, C. R., Newman, M. E. J.: Power-Law distributions in empirical data. SIAM Review 51(4), 661โ€“703 (2009). 5. Lรผ, L, Zhang, Z.-K., Zhou, T. Zipf's law leads to Heaps' law: Analyzing their relation in finite-size systems. PLoS One 5(12) e14139 (2010). https://doi.org/10.1371/jour- nal.pone.0014139 6. Montemurro, M. A.: Beyond the Zipf-Mandelbrot law in quantitative linguistics. Physica A: Statistical Mechanics and its Applications 300(3โ€“4), 567โ€“578 (2001). 7. Newman, M. E. J.: Power laws, Pareto distributions and Zipf's law. Contemporary Physics 46(5), 323โ€“351 (2005). 8. Piantadosi, S. T. Zipfโ€™s word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review 21(5), 1112โ€“1130 (2014). 9. Alstott, J., Bullmore, E., Plenz, D.: Powerlaw: A Python package for analysis of heavy-tailed distributions. PLoS ONE 9(1) (2014). 10. Baayen, R. H., Shafaei-Bajestan, E.: languageR: Analyzing Linguistic Data: A Practical In- troduction to Statistics. (R package version 1.5.0). https://CRAN.R- project.org/package=languageR (2019). 11. Evert, S., Baroni, M.: zipfR: Word frequency distributions in R (R package version 0.6-10 of 2017-08-17). In: Proceedings of the 45th Annual Meeting of the Association for Compu- tational Linguistics, Posters and Demonstrations Sessions, pp. 29โ€“32, ACL, Stroudsburg, PA (2007). 12. Gillespie, C. S.: Fitting heavy tailed distributions: The poweRlaw package. Journal of Sta- tistical Software 64(2), 1โ€“16. http://www.jstatsoft.org/v64/i02/ (2015). 13. Cleveland, W. S.: Visualizing data. Hobart Press, Summit, NJ (1993). 14. Wilkinson, L.: The grammar of graphics, Springer, New York (2005). 15. Zipf, G. K.: The psycho-biology of language. Routledge, London (1936). 16. Zipf, G. K.: Human behavior and the principle of least effort. Addison-Wesley, Cambridge, MA (1949). 17. Scott, M., Tribble, C.: Textual patterns. John Benjamins, Amsterdam (2006). 18. Stubbs, M. Three concepts of keywords. In: Bondi, M., Scott, M. (eds.), Keyness in texts, pp. 21โ€“42. John Benjamins, Amsterdam (2010). 19. Androutsopoulos, I., Malakasiotis, P.: A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Research 38, 135โ€“187 (2010). 20. Morris, A., Maier, V., Green, P.: From WER and RIL to MER and WIL: Improved evalua- tion measures for connected speech recognition. In: Proceedings of INTERSPEECH 2004 - ICSLP, 8th International Conference on Spoken Language Processing, pp. 2765โ€“2768 (2004). 21. Bokeh Development Team. Bokeh: Python library for interactive visualization. http://www.bokeh.pydata.org, last accessed 2019/09/30. 22. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19, 61โ€“74 (1993). 23. Rayson, P., Garside, R.: Comparing corpora using frequency profiling. In: WCC '00 pro- ceedings of the workshop on comparing corpora, pp. 1โ€“6. ACM, New York (2000). 24. Kilgarriff, A., Baisa, V., Buลกta, J., Jakubรญฤek, M., Kovรกล™, V., Michelfeit, J., Rychlรฝ, P., Su- chomel, V.: The Sketch Engine: ten years on. Lexicography 1, 7-36 (2014). 25. Bird, S., Loper, E., Klein, E.: Natural language processing with Python updated for NLTK 3.0. Newton, MA, Oโ€™Reilly (2019). 155 26. Francis, W. N., Kuฤera, H.: A standard corpus of present-day edited American English, for use with digital computers. Brown University, Providence, RI (1979). 27. Hundt, M., Sand, A., Skandera, P.: Manual of information to accompany The Freiburg โ€“ Brown Corpus of American English (โ€˜Frownโ€™). Department of English, Albert-Ludwigs- Universitรคt Freiburg, Freiburg, Germany (1999). 28. Kunegis, J., Preusse, J.: Fairness on the web: Alternatives to the power law. In: Proceedings of WebSci 2012, June 22โ€“24, 2012, pp. 175โ€“184. ACM, New York (2012). 29. Adamic, L.: Zipf, power-laws, and Paretoโ€”a ranking tutorial. https://www.hpl.hp.com/re- search/idl/papers/ranking/ranking.html, last accessed 2020/12/04. 30. Shannon, C. E.: A mathematical theory of communication. Bell System Technical Journal 27, 379โ€“423; 623โ€“656 (1948).