Entropy in Legal Language Roland Friedrich Mauro Luzzatto Elliott Ash ETH Zürich ETH Zürich ETH Zürich Zürich, Switzerland Zürich, Switzerland Zürich, Switzerland roland.friedrich@gess.ethz.ch mauroluzzatto@hotmail.com ashe@ethz.ch ABSTRACT A proffered reason for the relative inefficiency of civil-law in- We introduce a novel method to measure word ambiguity, i.e. local stitutions is that it is too rigid and cannot adapt well to changing entropy, based on a neural language model. We use the measure to circumstances. Code-based decision-making requires complex leg- investigate entropy in the written text of opinions published by the islation that is costly to maintain, decipher, apply, and revise. These U.S. Supreme Court (SCOTUS) and the German Bundesgerichts- points are anecdotal, and there is not much good empirical evidence hof (BGH), representative courts of the common-law and civil-law about them. Addressing these issues empirically is difficult because court systems respectively. We compare the local (word) entropy you do not have both common-law and civil-law systems operating measure with a global (document) entropy measure constructed in the same country. They also tend to be in different languages; with a compression algorithm. Our method uses an auxiliary corpus common-law countries tend to be English-speaking, while Latin- of parallel English and German to adjust for persistent differences Language and German-Speaking countries tend to have civil law. in entropy due to the languages. Our results suggest that the BGH’s Perhaps foremost, we lack good measures of the complexity of the texts are of lower entropy than the SCOTUS’s. Investigation of low- law. and high-entropy features suggests that the entropy differential is Our goal is to produce some new measures of legal complexity in driven by more frequent use of technical language in the German a comparative framework. We draw on recent technologies in neural court. language modeling to produce a new measure of local entropy at the word level. We then map entropy levels across case texts in an KEYWORDS English-speaking common law court (the U.S. Supreme Court) and a German-speaking civil law court (the German Bundesgerichtshof). neural language models, NLP, Word2Vec, entropy, civil law, com- The U.S. Supreme Court (SCOTUS) and German Bundesgerichts- mon law, judiciary, comparative law hof (BGH) are the highest courts in the respective legal systems. ACM Reference Format: They are also two of the most influential judiciaries in the broader Roland Friedrich, Mauro Luzzatto, and Elliott Ash. 2020. Entropy in Legal system of international law. Within the common-law and civil-law Language. In Proceedings of the 2020 Natural Legal Language Processing traditions, the SCOTUS and BGH are perhaps the most influential (NLLP) Workshop, 24 August 2020, San Diego, US. ACM, New York, NY, USA, high courts of the last century. 6 pages. https://doi.org/ We investigate the legal writing style of both the U.S. Supreme Court (SCOTUS) and the Bundesgerichtshof (BGH) from an infor- 1 INTRODUCTION mation theoretic perspective, based on a neural language model. The world’s legal systems feature two major traditions which have Concretely, we build our method on top of Mikolov’s et al. [19] spread to almost all countries. These systems are the “civil law” Word2Vec, in order to measure empirically the entropy at the token as the continuation and refinement of the Roman “jus civile”, level, i.e. the micro scale. and the “common law”, as it originated in England after the Norman We ask whether the two legal systems which these courts rep- conquest in 1066 [4]. To oversimplify somewhat, a broad distinction resent can be discriminated, solely based on information theoretic of the systems is that at civil law judges make decisions from codi- measures. We find that the BGH tends to have lower entropy than fied rules, while in the common law judges make decisions based the SCOTUS, reflecting greater use of low-entropy technical lan- on previous decisions. guage. Finally, in the case of the U.S. Supreme Court we further In civil law commentaries, cf. e.g. [22], it is argued that common investigate the temporal evolution of the entropy both at the micro law lacks a strong principled foundation. On this view, common law and macro level, by recording universal compression rates. is not systematised and without a general “strategy” but is rather driven by “trial and error” on a case by case basis. On the other hand, common law permits (judges) to adapt novel, pioneering 2 RELATED WORK and innovative ideas or doctrines more easily, and, as Posner [23] 2.1 Entropy in Language argued, it could be economically more efficient. Some evidence Shannon [27] in his seminal paper “Prediction and Entropy suggests that nations that followed the common law system have of Printed English” initiated the information theoretic study of had better growth prospects than civil-law countries [15], although natural languages. Similar to a theoretical physics approach, Shan- whether this effect is causal is not well-established. non applied the mathematical tools he had previously conceived to Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons understand information. That paper has led to a rich literature on License Attribution 4.0 International (CC BY 4.0). measuring the information content in written and spoken text. NLLP @ KDD 2020, August 24th, San Diego, US © 2020 Copyright held by the owner/author(s). In this literature, a common and useful assumption is that lan- guage is regular in the sense that the underlying stochastic data NLLP @ KDD 2020, August 24th, San Diego, US Roland Friedrich, Mauro Luzzatto, and Elliott Ash generating process is both stationary and ergodic, cf. e.g. [9]. Kon- Table 1: Details of the corpora toyiannis et al. [14] discuss various estimators for the Shannon entropy rate of a stationary ergodic process, and apply them to Corpus tokens sentences English texts. Most notable is the Lempel–Ziv [28] algorithm, which BGH Zivilsenat 30,166 410,612 consistently estimates the entropy lower bound for stationary er- BGH Strafsenat 11,313 110,645 godic processes. U.S. Supreme Court 35,060 673,287 A recent application of the Lempel-Ziv compression algorithm to EuroParl German 73,439 1,967,341 compare languages is Montemurro and Zanette [20]. They quantify EuroParl English 43,571 1,967,341 the contribution of word ordering across different linguistic families to see if different languages had different entropy properties. They find that the Kullback-Leibler divergence (difference in entropy) between shuffled and unshuffled texts is a structural constant across 3 DATA AND METHODS all languages considered. The code used in this paper is available at: A complementary paper comparing languages at the word level https://github.com/MauroLuzzatto/legal-entropy. is Bentz et al. [2]. They undertake a series of computer experiments to measure the word entropy across more than 1000 languages. They use unigram entropies which they estimate statistically. They 3.1 Data find that word entropies follow a narrow unimodal distribution. Our analysis is based on the U.S. Supreme Court decisions from the Degaetano-Ortlieb and Teich [5] is an application looking at years 1924 to 2013, and the decisions of the German Bundesgericht- changes in language entropy over time in a technical setting. They shof (BGH), covering the years 2014 until 2019. We separated the investigate the linguistic development of scientific English, by BGH data into rulings of the Zivil- and Strafsenat (civil and criminal analysing the Royal Society Corpus (RSC) and the Corpus of Late chambers). Modern English (CLMET) computationally. They consider 𝑛-gram Additionally, as a baseline, we use Koehn’s [13] EuroParl parallel language models (for 𝑛 = 3) and track the temporal changes of the corpus in German and English, consisting of the proceedings of the Kullback-Leibler divergence, as a measure of local ambiguity. Their European Parliament from 1996 to 2006. main finding is that Scientific English, as it emerged over time, re- Some summary tabulations on the scope of the corpus are re- sulted in an increasingly optimised code for written communication ported in Table 1. by specialists. 3.2 Pre-Processing For our analysis we use Python as well as spaCy [8] and NLTK [18] as our language processing tool. We apply the standard preprocessing steps in order to train the 2.2 Quantitative Analysis of Law Word2Vec model in Gensim – for details cf. [24]. As an exception we Our paper adds to the emerging literature in computational legal did not lemmatise and stem the tokens, and we kept capitalisation. studies. Exemplary of this literature is Carlson, Livermore and Rock- This makes English and German texts more comparable. more [3], who study the writing style of the U.S. Supreme Court. We also used the phraser function from Gensim to treat idiomatic Katz et al. [6] apply machine learning, combined with classical bigrams, such as "New York", and trigrams, such as "New York City", statistical methods, as a novel approach to predict the behaviour of as single tokens. the U.S. Supreme Court in a generalised, out-of-sample context. Deserving special mention is the determination of sentence Klingenstein, Hitchcock, and DeDeo [12] take an information- boundaries, a challenging task in legal writing [26]. We found this theory approach to legal cases. They present a large-scale quan- especially in the BGH civil case corpus, and less pronounced for the titative analysis of transcripts of London’s Old Bailey. They use U.S. Supreme Court and the EuroParl data. A multitude of abbrevi- the Jensen-Shannon divergence to show that trials for violent and ations, dates and most importantly statues involve a “dot”, leading nonviolent offenses become increasingly distinct. This divergence to a significant number of erroneous sentence tokens when the reflects broader cultural shifts starting around 1800. standard NLTK sentence tokenizer is naively applied. Therefore, The use of neural text embeddings in law is illustrated by Ash before using nltk. sent_tokenize we removed all “dots” which do and Chen [1]. That paper investigates the use of legal language and not indicate a sentence boundary, by compiling a look-up table in judicial reasoning in federal appellate courts, by using tools from order to use it in conjunction with regular expression operations natural language processing (NLP) and dense vector representa- (RegEx). tions. They show that the resulting vector space geometry contains information to distinguish court, time, and legal topics. 3.3 Measuring Local Entropy using a Neural The closest paper to ours is Katz and Bommarito [11]. They experiment with a number of methods for measuring complexity Language Model in law, applied to U.S. federal statutes. They use measures of lan- To train word embeddings we use Gensim’s [24] Word2Vec im- guage entropy based on word probabilities, but do not use word plementation. Word2Vec is a popular word embedding algorithm embeddings. which uses a neural language model to predict local word co- occurrence. A vector of predictive weights is learned, during the Entropy in Legal Language NLLP @ KDD 2020, August 24th, San Diego, US model training, for each word in the vocabulary. These weight vec- tors can be interpreted as the geometric location of the word in a semantic space, where words that are near each other in the space are semantically related. There are two architectural versions of Word2Vec, CBOW and SkipGram. Simplified, in a CBOW model the neighbouring context words are embedded to predict a left-out target word. In a SkipGram model, the target word is embedded to predict whether a paired word is sampled from the context or randomly sampled from outside the context. Once trained, the Word2Vec model gives a predicted probability distribution across words given a context. Out of the box, Gensim Figure 1: Empirical cumulative distribution functions offers for the CBOW model a command which yields the prob- (ECDF) of the local entropy values for the BGH’s Straf- and ability of a word to be a centre (target) word, depending on the Zivilsenat and the U.S. Supreme Court, displaying the civil context words to be specified. For the purposes of this project, we law-common law hysteresis. implemented the SkipGram version with hierarchical softmax. This model can be considered as the (neural) generalisation of the classi- cal 𝑛-gram. This serves as our base in order to determine the local entropies.1 The window size is a hyperparameter. Larger windows capture more semantic relations whereas smaller windows tend to convey syntactic information [10]. Our experiments showed that SkipGram for a small context (window) size, e.g. |𝑐 | = 2, showed better results than the default window size (|𝑐 | = 5).2 For the discussion of the local entropy calculation and its imple- mentation, cf. Appendix A. Figure 2: Left Panel: Probability distributions of the local en- For the Kolmogorov-Smirnov test we used SciPy. tropy values of the European Parliaments German proceed- ings (EuroParl de) and of its English translation (EuroParl 3.4 Measuring Global Entropy using en). Right Panel: Empirical cumulative distribution func- Lempel-Ziv Compression tions (ECDF) of the local entropy values for the BGH’s Straf- and Zivilsenat, the U.S. Supreme Court, EuroParl Deutsch, The second entropy measure we compute uses the Lempel-Ziv algo- and EuroParl English. rithm for sequential data. First, we compress the raw text using the gzip compression module interface in Python, with the compression level set to its maximum value (= 9). cumulative distribution functions ECDFBGH-Z, ECDFBGH-Str and We define the compression ratio, 𝑟𝑖 , of an individual text, txt𝑖 , as ECDFSC . | txt𝑖 | As can be seen in the figure, in the interval [0, 4] the distri- 𝑟𝑖 := | gzip(txt 𝑖 ) | , where | | denotes the size as measured in bits. The inverse ratio 𝑟 −1 yields the fraction of the compressed file in com- butions of the BGH’s criminal chambers and the U.S. Supreme parison to the original file. Note that 𝑟𝑖 > 0 for all documents 𝑖 and Court are similar, whereas for entropy values 𝑡 ≥ 4 we find that equivalently for the entire corpus. When considering compression ECDFBGH-Str (𝑡) > ECDFSC (𝑡), i.e. the Strafsenat’s curve is strictly rates for individual texts and the entire corpus, one should keep in above the U.S. Supreme Court’s. mind the sub-additivity of the Shannon entropy. Comparing the Zivilsenat to the U.S. Supreme Court we find that the difference between the ECDF curves of the Zivilsenat and the U.S. Supreme Court is always strictly positive i.e. ECDFBGH-Z (𝑡) − 4 RESULTS ECDFSC (𝑡) > 0, for every 𝑡 ∈ [0, max(entropy(BGH-Z))]. 4.1 Local Entropy of Words Our first analysis is to compare the distributions of the word en- 4.2 Adjusting for English-German Language tropies across the different corpora. We would like to determine Differences the differences in the distribution of the local entropy values of We use the EuroParl German corpus and its aligned English trans- the language used by the BGH’s Straf- and Zivilsenat and the U.S. lation as a baseline for two reasons. First, we want to gauge the Supreme Court. To this end, Figure 1 plots the respective empirical quality of our local entropy method. Second, we would like to dis- entangle language-specific effects, i.e. English vs. German, when 1 For a detailed discussion of predicting a context word from a target word, see https: comparing the U.S. Supreme Court to the BGH. //stackoverflow.com/questions/45102484/predict-middle-word-word2vec. Figure 2 demonstrates how the method behaves across languages 2 A recent experimental study for SkipGram models by Lison and Kutuzov [17], found using the parallel, sentence aligned EuroParl German and English that for semantic similarity tasks right-side contexts are more important than left- side contexts, at least for English, and that the average model performance was not corpora. As predicted by theory for a good translation, our method significantly influenced by the removal of stop words. yields two highly identical probability distributions (Left Panel). NLLP @ KDD 2020, August 24th, San Diego, US Roland Friedrich, Mauro Luzzatto, and Elliott Ash Corpus Compression Ratio Entropy EuroParl German 0.323 EuroParl English 0.322 U.S. Supreme Court 0.316 BGH Strafsenat 0.300 BGH Zivilsenat 0.283 Table 2: Inverse Compression Ratio Entropy, by Corpus. See Subsection 3.4 for method details. As seen in the Right Panel, the empirical cumulative distribution Figure 3: Per document inverse gzip compression ratio of the functions of the local entropies are also very similar. It would be U.S. Supreme Court for the period 1924 until 2013 (higher interesting to further study the influence of 𝑛-grams on the local value means higher entropy). entropy distribution of translations. We quantified the distance between the empirical distribution functions of the EuroParl English and German corpora via the two-sided Kolmogorov–Smirnov test [7]. The null hypothesis 𝐻 0 states that two observed and stochastically independent samples are drawn from the same (continuous) distribution. We calculated the value of the ECDF in steps of 1/10 in the interval [0, 16], i.e. the range of the entropy values. The result for the 𝐷-statistics is 0.069 and for the two-tailed 𝑝-value 0.843, therefore we cannot reject 𝐻 0 . Second, the comparison with the baseline suggests, that as we hypothesised the (one might even argue scientific) use of German and English, respectively, in the courts has significantly less local entropy, as compared to the more colloquial and non technical use of the language in political speeches. This results in the strict local ambiguity order ECDFBGH-Z ≺ ECDFBGH-Str ≺ ECDFSC ≺ ECDFEP-de, and with ECDFEP-de ∼ ECDFEP-en . 4.3 Global Entropy of Documents Now we produce the more global measure of entropy using the Figure 4: Word clouds for Lowest-Entropy Words: Top left: compression-based measure. We estimated the macroscopic entropy EuroParl German. Top right: EuroParl English; Bottom left: of the different corpora by compressing the entire raw text file for BGH Zivilsenat. Bottom right: U.S. Supreme Court (SCO- each and then calculating the corresponding inverse compression TUS). ratios, as described above. A higher value means that the corpus has higher entropy per segment of text. Put differently, a lower value means that there is relatively more structure or predictability of administrative (statutory) law in the U.S. system. Once statues are in the underlying text features. extensively used, the need for efficient methods of referral emerge, Table 2 reports the compression ratios for each corpus. As be- e.g. [§§ articles, sections, lit.,...], leading to a cryptic, pseudocode- fore, the values for the EuroParl corpora are almost identical, and like style of writing. This code-like, technical style was already they have the highest entropy rate. This likely reflects the broader extensively used by the BGH or the French Court of Cassation. diversity of issues covered in EuroParl relative to the law. The U.S. Supreme Corpus has a slightly lower entropy rate. Meanwhile, the 4.4 Low-Entropy Words are Functional BGH’s Strafsenat and Zivilsenat corpora yield substantially lower To further substantiate the above ideas, we selected from each cor- values, with the BGH’s civil courts having the lowest ratio of 0.283. pus (SCOTUS, BGH Zivil- and Strafsenat, EuroParl German and Next, we show how entropy varies over time in the SCOTUS English) tokens with the lowest local entropy value ≤ 1. Fig. 4 in- data. Fig. 3 shows the inverse compression ratio entropy measures cludes word clouds for the lowest-entropy words in our vocabulary. for the records of the U.S. Supreme Court in the last century. We For the BGH (bottom left) one recognises key phrases from can see that entropy has decreased since the 1950s, indicating an procedural law such as, e.g. ‘zurückverweisen’ (to send back a increase in the relative structure or predictability in the text. request). We see technical language for civil cases, such as ‘Insol- This trend can be interpreted as a more formalised and standard- venzverfahrens’ (bankruptcy proceeding). For the SCOTUS, we see ised writing style. The shift could be due to the ongoing expansion procedural, criminal and civil technical phrases such as ‘beyond Entropy in Legal Language NLLP @ KDD 2020, August 24th, San Diego, US reasonable’ and ’qualified immunity’. For the EuroParl data, the article-10.1515-cllt-2018-0088/article-10.1515-cllt-2018-0088.xml [6] MJ Bommarito DM Katz and J Blackman. 2017. A general approach for predicting dominating lowest entropy phrases are procedural and related to the behavior of the Supreme Court of the United States. PLoS ONE 12, 4 (2017). the Parliament’s sessions, such as, e.g. the German ‘siehe_Protokoll’ https://doi.org/10.1371/journal.pone.0174698 which corresponds to the English ‘see_Minutes’. [7] J. L. Hodges. 1958. The significance probability of the smirnov two-sample test. Ark. Mat. 3, 5 (01 1958), 469–486. DOI:http://dx.doi.org/10.1007/BF02589501 The very low entropy words, serve as functional foundations [8] Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language under- in order to typify the respective environment and to set the tone. standing with Bloom embeddings, convolutional neural networks and incremen- These reoccurring phrases have a very precise meaning, as the tal parsing. (2017). To appear. [9] D. Jurafsky and J. H. Martin. 2019. Speech and Language Processing (3 ed.). draft; human reader recognises, and as quantitatively reflected in our https://web.stanford.edu/~jurafsky/slp3/. neural model. [10] U. Kamath, J. Liu, and J. Whitaker. 2019. Deep Learning for NLP and Speech Recognition. Springer International Publishing. https://books.google.ch/books? An in-depth analysis of the precise distribution of the local en- id=8cmcDwAAQBAJ tropies along the different linguistic axes, and the broader syntactic [11] Daniel Martin Katz and Michael James Bommarito. 2014. Measuring the com- and semantic categories, is left for a separate publication. plexity of the law: the United States Code. Artificial intelligence and law 22, 4 (2014), 337–374. [12] Sara Klingenstein, Tim Hitchcock, and Simon DeDeo. 2014. The civilizing process 5 CONCLUSION in London’s Old Bailey. Proceedings of the National Academy of Sciences 111, 26 (2014), 9419–9424. DOI:http://dx.doi.org/10.1073/pnas.1405984111 Our analysis has shown that the writing style in civil law has lower [13] Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Transla- relative entropy than the common law, at least in the important tion. In Conference Proceedings: the tenth Machine Translation Summit. AAMT, cases of the SCOTUS and BGH. We have shown this for two mea- AAMT, Phuket, Thailand, 79–86. http://mt-archive.info/MTS-2005-Koehn.pdf [14] I. Kontoyiannis, P. H. Algoet, Y. M. Suhov, and A. J. Wyner. 1998. Nonparametric sures. First, local ambiguity, i.e. word entropy, produced using a entropy estimation for stationary processes and random fields, with applications neural language model, and second, global entropy produced from a to English text. IEEE Transactions on Information Theory 44, 3 (1998), 1319–1327. [15] Rafael La Porta, Florencio Lopez-de Silanes, and Andrei Shleifer. 2008. The compression ratio algorithm. Civil and common law writing styles economic consequences of legal origins. Journal of economic literature 46, 2 are distinguishable on a purely information-theoretic base. (2008), 285–332. The results are helpful from the perspectives of history and social [16] Omer Levy and Yoav Goldberg. 2014. Dependency-Based Word Embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational science. The original German legal doctrine is very much rooted Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, in jurisprudence and has been strongly influenced, especially after Baltimore, Maryland, 302–308. DOI:http://dx.doi.org/10.3115/v1/P14-2050 the second half of the 19th century, by the development of natural [17] Pierre Lison and Andrey Kutuzov. 2017. Redefining Context Windows for Word Embedding Models: An Experimental Study. In Proceedings of the 21st Nordic Con- sciences. This systematic approach is reflected in the writing style. ference on Computational Linguistics. Association for Computational Linguistics, Code-based legal writing requires, as argued above, efficient and Gothenburg, Sweden, 284–288. https://www.aclweb.org/anthology/W17-0239 [18] Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit. standardised mechanisms of referencing, common to all scientific In In Proceedings of the ACL Workshop on Effective Tools and Methodologies for writing. Teaching Natural Language Processing and Computational Linguistics. Philadelphia: Our method innovates by using a neural language model, com- Association for Computational Linguistics. [19] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. bined with data compression algorithms, in order to empirically Distributed Representations of Words and Phrases and their Compositionality. In determine both word and stylistic ambiguity, i.e. local and global Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, entropy. This approach proves to be fruitful and could integrate nat- M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 3111–3119. http://papers.nips.cc/paper/5021-distributed-representations- urally into future enhancements of (deeper) neural language models. of-words-and-phrases-and-their-compositionality.pdf In future work these could provide an even finer spatio-temporally [20] M.A. Montemurro and D. H. Zanette. 2011. Universal Entropy of Word Ordering Across Linguistic Families. PLoS ONE 6, 5 (2011). resolution of how information is distributed on different linguistic [21] Frederic Morin and Yoshua Bengio. 2005. Hierarchical Probabilistic Neural scales and time, ranging from the word to the corpus level. Network Language Model. In Proceedings of the Tenth International Workshop on In summary, our implementation and use of a local entropy Artificial Intelligence and Statistics, Robert G. Cowell and Zoubin Ghahramani (Eds.). Society for Artificial Intelligence and Statistics, 246–252. http://www.iro. measure, based on a neural language model, has led to striking umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf results that contribute to an old debate on legal traditions. The [22] Marcel Alexander Niggli and Louis Frédéric Muskens. 2014. BSK contribution could be important both from a linguistic but also StGB-Niggli/Muskens, Art. 11. In Schweizerische Strafprozessord- nung/Jugendstrafprozessordnung (StPO/JStPO) (2 ed.), Marianne Heer Marcel legal perspective. We foresee a broad range of further applications. Alexander Niggli and Hans Wiprächtiger (Eds.). Vol. 1. Helbing & Lichtenhahn, 3501. REFERENCES [23] R.A. Posner. 2003. Economic Analysis of Law. Aspen Publishers. https://books. google.ch/books?id=gyUkAQAAIAAJ [1] Elliott Ash and Daniel L. Chen. 2018. Mapping the Geometry of Law Using [24] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling Document Embeddings. with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges [2] Christian Bentz, Dimitrios Alikaniotis, Michael Cysouw, and Ramon Ferrer-i for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/ Cancho. 2017. The Entropy of Words—Learnability and Expressivity across More 884893/en. than 1000 Languages. Entropy 19, 6 (Jun 2017), 275. DOI:http://dx.doi.org/10. [25] Xin Rong. 2014. word2vec Parameter Learning Explained. (2014). http://arxiv. 3390/e19060275 org/abs/1411.2738 cite arxiv:1411.2738. [3] Keith Carlson, Michael A Livermore, and Daniel Rockmore. 2015-2016. A Quanti- [26] George Sanchez. 2019. Sentence Boundary Detection in Legal Text. In Pro- tative Analysis of Writing Style on the U.S. Supreme Court. Washington University ceedings of the Natural Legal Language Processing Workshop 2019. Associa- Law Review 93 (2015-2016), 1461. tion for Computational Linguistics, Minneapolis, Minnesota, 31–38. DOI:http: [4] Joseph Dainow. 1966. The Civil Law and the Common Law: Some Points of //dx.doi.org/10.18653/v1/W19-2204 Comparison. The American Journal of Comparative Law 15, 3 (1966), 419–435. [27] C. E. Shannon. 1951. Prediction and Entropy of Printed English. Bell System http://www.jstor.org/stable/838275 Technical Journal 30, 1 (1951), 50–64. DOI:http://dx.doi.org/10.1002/j.1538-7305. [5] Stefania Degaetano-Ortlieb and Elke Teich. 2019. Toward an optimal code for 1951.tb01366.x communication: The case of scientific English. Corpus Linguistics and Linguistic [28] J. Ziv and A. Lempel. 1977. A universal algorithm for sequential data compression. Theory 0 (2019). https://www.degruyter.com/view/journals/cllt/ahead-of-print/ IEEE Transactions on Information Theory 23, 3 (1977), 337–343. NLLP @ KDD 2020, August 24th, San Diego, US Roland Friedrich, Mauro Luzzatto, and Elliott Ash A THEORY or ambiguity is the map Here we give a theoretical description of the steps underlying our 𝐻 :V → R+, approach. 𝑤 ↦→ 𝐻 (𝜇 𝑤 ), A.1 Preprocessing which assigns to every token 𝑤 the Shannon entropy of the corre- sponding probability distribution 𝜇 𝑤 . The posterior distribution is Let 𝐶 be a non-empty set, the corpus. For 𝑛 ∈ N, consider the map given by a Boltzmann distribution (softmax). 𝜋𝑛 : 𝐶 → 𝑉𝑛 It is calculated as follows. Let 𝑊 be the |V | × 𝑁 input weight where 𝑉𝑛 is the, possibly empty, set of 𝑛-grams (associated to 𝐶), matrix from the input layer to the hidden layer and 𝑊 e the 𝑁 × |V | which satisfy 𝑉𝑘 ∩ 𝑉𝑙 = ∅, for 𝑙 ≠ 𝑘. Usually, the set of unigrams 𝑉1 , weight matrix from the hidden layer to the output layer in the is called the vocabulary of the corpus 𝐶. SkipGram model with hierarchical softmax. For a fixed 𝜈 ∈ N, set Every token 𝑤𝑖 ∈ V determines a pair of vectors (𝑣𝑖 , 𝑣˜𝑖 ), the Ø𝜈 input vector 𝑣𝑖 and the output vector 𝑣˜𝑖 , which are given by the 𝑖th V𝜈 := 𝑉𝑛 row of 𝑊 and the 𝑖th column of 𝑊 e , respectively. 𝑛=1 3 More general, i.e. functional neighbourhoods are of course possible, e.g. based on which is the set of (two-sided) uni-, bi-, tri- up to 𝜈-grams, and which, grammatical information, as considered by Levy and Goldberg [16]. for 𝜈 large enough, yields an approximation (or pairwise disjoint Let |V | decomposition) of the corpus 𝐶, which capture both syntactic and Õ 𝑍𝑖 := 𝑒 ⟨𝑣˜ 𝑗 |𝑣𝑖 ⟩ (1) semantic information.3 Then V𝜈 is the (generalised) vocabulary up 𝑗=1 to order 𝜈. The elements 𝑤 ∈ V𝜈 , or V if 𝜈 is fixed and clear from be the local partition function corresponding to the target 𝑤𝑖 , with the context, are tokens or 𝑛-grams, which might be considered as the sum taken over all tokens 𝑤 𝑗 ∈ V. (We use the bra-ket nota- 𝑛-order words. We denote by |V | the size of V, i.e. the number of tion). pairwise different tokens. For the SkipGram model with context 𝑐, the probability 𝜇 𝑤𝑖 (𝑤𝑜 ) The family of maps 𝜋𝑛 , and hence the specific sets 𝑉𝑛 , determine of a token 𝑤𝑜 being an actual 𝑐-context output word of 𝑤𝑖 , is given the preprocessing of the corpus data. by 1 A.2 Local Entropy from Word2Vec 𝑝 (𝑤𝑜 |𝑤𝑖 ) := 𝜇 𝑤𝑖 (𝑤𝑜 ) := 𝑒 ⟨𝑣˜𝑜 |𝑣𝑖 ⟩ . (2) 𝑍𝑖 The word2vec framework consists of a bundle of mathematical Therefore, the local entropy of the target 𝑤𝑖 (with context 𝑐) is objects [19, 25]. First, it defines a dense Hilbert space representation, given by word2vec : V → R𝑁 , |V | Õ 𝑤 ↦→ ℎ𝑤 , 𝐻 (𝑤𝑖 ) := 𝐻 (𝜇 𝑤𝑖 ) = − 𝑝 (𝑤 𝑗 |𝑤𝑖 ) · log2 (𝑝 (𝑤 𝑗 |𝑤𝑖 )). (3) 𝑗=1 where 𝑁 ∈ N is the dimension of the coordinate space, which is a hyper-parameter of the model. Let 𝔓(V) be denote the set of A.3 Gensim Implementation discrete probability distributions on V. Then, there exists a map We implemented our local entropy calculation for the SkipGram 𝑓𝑤2𝑣 : V → 𝔓(V), model in Gensim, with the following parameters: context window= 𝑤 ↦→ 𝜇𝑤 , 2, 𝑁 = 300 and 30 training epochs with hierarchical softmax [21]. The output weight matrix 𝑊 e , and the input weight matrix 𝑊 , which associates to every token 𝑤 a probability distribution 𝜇 𝑤 , are stored by Gensim in the files syn1 (for hierarchical softmax) namely the posterior (multinomial) distribution. The local entropy and syn0, respectively. Note, if negative sampling is used, then the output weights are stored in syn1neg.