Entropy in Legal Language
                Roland Friedrich                                          Mauro Luzzatto                                    Elliott Ash
                  ETH Zürich                                              ETH Zürich                                       ETH Zürich
              Zürich, Switzerland                                     Zürich, Switzerland                               Zürich, Switzerland
         roland.friedrich@gess.ethz.ch                             mauroluzzatto@hotmail.com                              ashe@ethz.ch

ABSTRACT                                                                                  A proffered reason for the relative inefficiency of civil-law in-
We introduce a novel method to measure word ambiguity, i.e. local                      stitutions is that it is too rigid and cannot adapt well to changing
entropy, based on a neural language model. We use the measure to                       circumstances. Code-based decision-making requires complex leg-
investigate entropy in the written text of opinions published by the                   islation that is costly to maintain, decipher, apply, and revise. These
U.S. Supreme Court (SCOTUS) and the German Bundesgerichts-                             points are anecdotal, and there is not much good empirical evidence
hof (BGH), representative courts of the common-law and civil-law                       about them. Addressing these issues empirically is difficult because
court systems respectively. We compare the local (word) entropy                        you do not have both common-law and civil-law systems operating
measure with a global (document) entropy measure constructed                           in the same country. They also tend to be in different languages;
with a compression algorithm. Our method uses an auxiliary corpus                      common-law countries tend to be English-speaking, while Latin-
of parallel English and German to adjust for persistent differences                    Language and German-Speaking countries tend to have civil law.
in entropy due to the languages. Our results suggest that the BGH’s                    Perhaps foremost, we lack good measures of the complexity of the
texts are of lower entropy than the SCOTUS’s. Investigation of low-                    law.
and high-entropy features suggests that the entropy differential is                       Our goal is to produce some new measures of legal complexity in
driven by more frequent use of technical language in the German                        a comparative framework. We draw on recent technologies in neural
court.                                                                                 language modeling to produce a new measure of local entropy at
                                                                                       the word level. We then map entropy levels across case texts in an
KEYWORDS                                                                               English-speaking common law court (the U.S. Supreme Court) and a
                                                                                       German-speaking civil law court (the German Bundesgerichtshof).
neural language models, NLP, Word2Vec, entropy, civil law, com-
                                                                                          The U.S. Supreme Court (SCOTUS) and German Bundesgerichts-
mon law, judiciary, comparative law
                                                                                       hof (BGH) are the highest courts in the respective legal systems.
ACM Reference Format:                                                                  They are also two of the most influential judiciaries in the broader
Roland Friedrich, Mauro Luzzatto, and Elliott Ash. 2020. Entropy in Legal              system of international law. Within the common-law and civil-law
Language. In Proceedings of the 2020 Natural Legal Language Processing                 traditions, the SCOTUS and BGH are perhaps the most influential
(NLLP) Workshop, 24 August 2020, San Diego, US. ACM, New York, NY, USA,
                                                                                       high courts of the last century.
6 pages. https://doi.org/
                                                                                          We investigate the legal writing style of both the U.S. Supreme
                                                                                       Court (SCOTUS) and the Bundesgerichtshof (BGH) from an infor-
1    INTRODUCTION                                                                      mation theoretic perspective, based on a neural language model.
The world’s legal systems feature two major traditions which have                      Concretely, we build our method on top of Mikolov’s et al. [19]
spread to almost all countries. These systems are the “civil law”                      Word2Vec, in order to measure empirically the entropy at the token
as the continuation and refinement of the Roman “jus civile”,                          level, i.e. the micro scale.
and the “common law”, as it originated in England after the Norman                        We ask whether the two legal systems which these courts rep-
conquest in 1066 [4]. To oversimplify somewhat, a broad distinction                    resent can be discriminated, solely based on information theoretic
of the systems is that at civil law judges make decisions from codi-                   measures. We find that the BGH tends to have lower entropy than
fied rules, while in the common law judges make decisions based                        the SCOTUS, reflecting greater use of low-entropy technical lan-
on previous decisions.                                                                 guage. Finally, in the case of the U.S. Supreme Court we further
   In civil law commentaries, cf. e.g. [22], it is argued that common                  investigate the temporal evolution of the entropy both at the micro
law lacks a strong principled foundation. On this view, common law                     and macro level, by recording universal compression rates.
is not systematised and without a general “strategy” but is rather
driven by “trial and error” on a case by case basis. On the other
hand, common law permits (judges) to adapt novel, pioneering                           2 RELATED WORK
and innovative ideas or doctrines more easily, and, as Posner [23]                     2.1 Entropy in Language
argued, it could be economically more efficient. Some evidence
                                                                                       Shannon [27] in his seminal paper “Prediction and Entropy
suggests that nations that followed the common law system have
                                                                                       of Printed English” initiated the information theoretic study of
had better growth prospects than civil-law countries [15], although
                                                                                       natural languages. Similar to a theoretical physics approach, Shan-
whether this effect is causal is not well-established.
                                                                                       non applied the mathematical tools he had previously conceived to
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons   understand information. That paper has led to a rich literature on
License Attribution 4.0 International (CC BY 4.0).
                                                                                       measuring the information content in written and spoken text.
NLLP @ KDD 2020, August 24th, San Diego, US
© 2020 Copyright held by the owner/author(s).                                            In this literature, a common and useful assumption is that lan-
                                                                                       guage is regular in the sense that the underlying stochastic data
NLLP @ KDD 2020, August 24th, San Diego, US                                                          Roland Friedrich, Mauro Luzzatto, and Elliott Ash


generating process is both stationary and ergodic, cf. e.g. [9]. Kon-                     Table 1: Details of the corpora
toyiannis et al. [14] discuss various estimators for the Shannon
entropy rate of a stationary ergodic process, and apply them to                     Corpus                      tokens     sentences
English texts. Most notable is the Lempel–Ziv [28] algorithm, which
                                                                                    BGH Zivilsenat              30,166        410,612
consistently estimates the entropy lower bound for stationary er-
                                                                                    BGH Strafsenat              11,313        110,645
godic processes.
                                                                                    U.S. Supreme Court          35,060        673,287
    A recent application of the Lempel-Ziv compression algorithm to
                                                                                    EuroParl German             73,439      1,967,341
compare languages is Montemurro and Zanette [20]. They quantify
                                                                                    EuroParl English            43,571      1,967,341
the contribution of word ordering across different linguistic families
to see if different languages had different entropy properties. They
find that the Kullback-Leibler divergence (difference in entropy)
between shuffled and unshuffled texts is a structural constant across    3     DATA AND METHODS
all languages considered.
                                                                         The code used in this paper is available at:
    A complementary paper comparing languages at the word level
                                                                           https://github.com/MauroLuzzatto/legal-entropy.
is Bentz et al. [2]. They undertake a series of computer experiments
to measure the word entropy across more than 1000 languages.
They use unigram entropies which they estimate statistically. They       3.1    Data
find that word entropies follow a narrow unimodal distribution.          Our analysis is based on the U.S. Supreme Court decisions from the
    Degaetano-Ortlieb and Teich [5] is an application looking at         years 1924 to 2013, and the decisions of the German Bundesgericht-
changes in language entropy over time in a technical setting. They       shof (BGH), covering the years 2014 until 2019. We separated the
investigate the linguistic development of scientific English, by         BGH data into rulings of the Zivil- and Strafsenat (civil and criminal
analysing the Royal Society Corpus (RSC) and the Corpus of Late          chambers).
Modern English (CLMET) computationally. They consider 𝑛-gram                Additionally, as a baseline, we use Koehn’s [13] EuroParl parallel
language models (for 𝑛 = 3) and track the temporal changes of the        corpus in German and English, consisting of the proceedings of the
Kullback-Leibler divergence, as a measure of local ambiguity. Their      European Parliament from 1996 to 2006.
main finding is that Scientific English, as it emerged over time, re-       Some summary tabulations on the scope of the corpus are re-
sulted in an increasingly optimised code for written communication       ported in Table 1.
by specialists.
                                                                         3.2    Pre-Processing
                                                                         For our analysis we use Python as well as spaCy [8] and NLTK [18]
                                                                         as our language processing tool.
                                                                            We apply the standard preprocessing steps in order to train the
2.2    Quantitative Analysis of Law
                                                                         Word2Vec model in Gensim – for details cf. [24]. As an exception we
Our paper adds to the emerging literature in computational legal         did not lemmatise and stem the tokens, and we kept capitalisation.
studies. Exemplary of this literature is Carlson, Livermore and Rock-    This makes English and German texts more comparable.
more [3], who study the writing style of the U.S. Supreme Court.            We also used the phraser function from Gensim to treat idiomatic
Katz et al. [6] apply machine learning, combined with classical          bigrams, such as "New York", and trigrams, such as "New York City",
statistical methods, as a novel approach to predict the behaviour of     as single tokens.
the U.S. Supreme Court in a generalised, out-of-sample context.             Deserving special mention is the determination of sentence
   Klingenstein, Hitchcock, and DeDeo [12] take an information-          boundaries, a challenging task in legal writing [26]. We found this
theory approach to legal cases. They present a large-scale quan-         especially in the BGH civil case corpus, and less pronounced for the
titative analysis of transcripts of London’s Old Bailey. They use        U.S. Supreme Court and the EuroParl data. A multitude of abbrevi-
the Jensen-Shannon divergence to show that trials for violent and        ations, dates and most importantly statues involve a “dot”, leading
nonviolent offenses become increasingly distinct. This divergence        to a significant number of erroneous sentence tokens when the
reflects broader cultural shifts starting around 1800.                   standard NLTK sentence tokenizer is naively applied. Therefore,
   The use of neural text embeddings in law is illustrated by Ash        before using nltk. sent_tokenize we removed all “dots” which do
and Chen [1]. That paper investigates the use of legal language and      not indicate a sentence boundary, by compiling a look-up table in
judicial reasoning in federal appellate courts, by using tools from      order to use it in conjunction with regular expression operations
natural language processing (NLP) and dense vector representa-           (RegEx).
tions. They show that the resulting vector space geometry contains
information to distinguish court, time, and legal topics.
                                                                         3.3    Measuring Local Entropy using a Neural
   The closest paper to ours is Katz and Bommarito [11]. They
experiment with a number of methods for measuring complexity                    Language Model
in law, applied to U.S. federal statutes. They use measures of lan-      To train word embeddings we use Gensim’s [24] Word2Vec im-
guage entropy based on word probabilities, but do not use word           plementation. Word2Vec is a popular word embedding algorithm
embeddings.                                                              which uses a neural language model to predict local word co-
                                                                         occurrence. A vector of predictive weights is learned, during the
Entropy in Legal Language                                                                                              NLLP @ KDD 2020, August 24th, San Diego, US


model training, for each word in the vocabulary. These weight vec-
tors can be interpreted as the geometric location of the word in a
semantic space, where words that are near each other in the space
are semantically related.
   There are two architectural versions of Word2Vec, CBOW and
SkipGram. Simplified, in a CBOW model the neighbouring context
words are embedded to predict a left-out target word. In a SkipGram
model, the target word is embedded to predict whether a paired
word is sampled from the context or randomly sampled from outside
the context.
   Once trained, the Word2Vec model gives a predicted probability
distribution across words given a context. Out of the box, Gensim                         Figure 1: Empirical cumulative distribution functions
offers for the CBOW model a command which yields the prob-                                (ECDF) of the local entropy values for the BGH’s Straf- and
ability of a word to be a centre (target) word, depending on the                          Zivilsenat and the U.S. Supreme Court, displaying the civil
context words to be specified. For the purposes of this project, we                       law-common law hysteresis.
implemented the SkipGram version with hierarchical softmax. This
model can be considered as the (neural) generalisation of the classi-
cal 𝑛-gram. This serves as our base in order to determine the local
entropies.1
   The window size is a hyperparameter. Larger windows capture
more semantic relations whereas smaller windows tend to convey
syntactic information [10]. Our experiments showed that SkipGram
for a small context (window) size, e.g. |𝑐 | = 2, showed better results
than the default window size (|𝑐 | = 5).2
   For the discussion of the local entropy calculation and its imple-
mentation, cf. Appendix A.                                                                Figure 2: Left Panel: Probability distributions of the local en-
   For the Kolmogorov-Smirnov test we used SciPy.                                         tropy values of the European Parliaments German proceed-
                                                                                          ings (EuroParl de) and of its English translation (EuroParl
3.4      Measuring Global Entropy using                                                   en). Right Panel: Empirical cumulative distribution func-
         Lempel-Ziv Compression                                                           tions (ECDF) of the local entropy values for the BGH’s Straf-
                                                                                          and Zivilsenat, the U.S. Supreme Court, EuroParl Deutsch,
The second entropy measure we compute uses the Lempel-Ziv algo-
                                                                                          and EuroParl English.
rithm for sequential data. First, we compress the raw text using the
gzip compression module interface in Python, with the compression
level set to its maximum value (= 9).                                                     cumulative distribution functions ECDFBGH-Z, ECDFBGH-Str and
    We define the compression ratio, 𝑟𝑖 , of an individual text, txt𝑖 , as                ECDFSC .
          | txt𝑖 |                                                                           As can be seen in the figure, in the interval [0, 4] the distri-
𝑟𝑖 := | gzip(txt
                 𝑖 ) | , where | | denotes the size as measured in bits. The
inverse ratio 𝑟 −1 yields the fraction of the compressed file in com-                     butions of the BGH’s criminal chambers and the U.S. Supreme
parison to the original file. Note that 𝑟𝑖 > 0 for all documents 𝑖 and                    Court are similar, whereas for entropy values 𝑡 ≥ 4 we find that
equivalently for the entire corpus. When considering compression                          ECDFBGH-Str (𝑡) > ECDFSC (𝑡), i.e. the Strafsenat’s curve is strictly
rates for individual texts and the entire corpus, one should keep in                      above the U.S. Supreme Court’s.
mind the sub-additivity of the Shannon entropy.                                              Comparing the Zivilsenat to the U.S. Supreme Court we find that
                                                                                          the difference between the ECDF curves of the Zivilsenat and the
                                                                                          U.S. Supreme Court is always strictly positive i.e. ECDFBGH-Z (𝑡) −
4 RESULTS                                                                                 ECDFSC (𝑡) > 0, for every 𝑡 ∈ [0, max(entropy(BGH-Z))].
4.1 Local Entropy of Words
Our first analysis is to compare the distributions of the word en-                        4.2    Adjusting for English-German Language
tropies across the different corpora. We would like to determine                                 Differences
the differences in the distribution of the local entropy values of                        We use the EuroParl German corpus and its aligned English trans-
the language used by the BGH’s Straf- and Zivilsenat and the U.S.                         lation as a baseline for two reasons. First, we want to gauge the
Supreme Court. To this end, Figure 1 plots the respective empirical                       quality of our local entropy method. Second, we would like to dis-
                                                                                          entangle language-specific effects, i.e. English vs. German, when
1 For a detailed discussion of predicting a context word from a target word, see https:   comparing the U.S. Supreme Court to the BGH.
//stackoverflow.com/questions/45102484/predict-middle-word-word2vec.                         Figure 2 demonstrates how the method behaves across languages
2 A recent experimental study for SkipGram models by Lison and Kutuzov [17], found
                                                                                          using the parallel, sentence aligned EuroParl German and English
that for semantic similarity tasks right-side contexts are more important than left-
side contexts, at least for English, and that the average model performance was not       corpora. As predicted by theory for a good translation, our method
significantly influenced by the removal of stop words.                                    yields two highly identical probability distributions (Left Panel).
NLLP @ KDD 2020, August 24th, San Diego, US                                                             Roland Friedrich, Mauro Luzzatto, and Elliott Ash


       Corpus                       Compression Ratio Entropy
       EuroParl German                        0.323
       EuroParl English                       0.322
       U.S. Supreme Court                     0.316
       BGH Strafsenat                         0.300
       BGH Zivilsenat                         0.283
Table 2: Inverse Compression Ratio Entropy, by Corpus. See
Subsection 3.4 for method details.


As seen in the Right Panel, the empirical cumulative distribution          Figure 3: Per document inverse gzip compression ratio of the
functions of the local entropies are also very similar. It would be        U.S. Supreme Court for the period 1924 until 2013 (higher
interesting to further study the influence of 𝑛-grams on the local         value means higher entropy).
entropy distribution of translations.
   We quantified the distance between the empirical distribution
functions of the EuroParl English and German corpora via the
two-sided Kolmogorov–Smirnov test [7]. The null hypothesis 𝐻 0
states that two observed and stochastically independent samples
are drawn from the same (continuous) distribution. We calculated
the value of the ECDF in steps of 1/10 in the interval [0, 16], i.e. the
range of the entropy values. The result for the 𝐷-statistics is 0.069
and for the two-tailed 𝑝-value 0.843, therefore we cannot reject 𝐻 0 .
   Second, the comparison with the baseline suggests, that as we
hypothesised the (one might even argue scientific) use of German
and English, respectively, in the courts has significantly less local
entropy, as compared to the more colloquial and non technical use
of the language in political speeches. This results in the strict local
ambiguity order
       ECDFBGH-Z ≺ ECDFBGH-Str ≺ ECDFSC ≺ ECDFEP-de,
and with ECDFEP-de ∼ ECDFEP-en .

4.3    Global Entropy of Documents
Now we produce the more global measure of entropy using the                Figure 4: Word clouds for Lowest-Entropy Words: Top left:
compression-based measure. We estimated the macroscopic entropy            EuroParl German. Top right: EuroParl English; Bottom left:
of the different corpora by compressing the entire raw text file for       BGH Zivilsenat. Bottom right: U.S. Supreme Court (SCO-
each and then calculating the corresponding inverse compression            TUS).
ratios, as described above. A higher value means that the corpus
has higher entropy per segment of text. Put differently, a lower
value means that there is relatively more structure or predictability      of administrative (statutory) law in the U.S. system. Once statues are
in the underlying text features.                                           extensively used, the need for efficient methods of referral emerge,
   Table 2 reports the compression ratios for each corpus. As be-          e.g. [§§ articles, sections, lit.,...], leading to a cryptic, pseudocode-
fore, the values for the EuroParl corpora are almost identical, and        like style of writing. This code-like, technical style was already
they have the highest entropy rate. This likely reflects the broader       extensively used by the BGH or the French Court of Cassation.
diversity of issues covered in EuroParl relative to the law. The U.S.
Supreme Corpus has a slightly lower entropy rate. Meanwhile, the           4.4    Low-Entropy Words are Functional
BGH’s Strafsenat and Zivilsenat corpora yield substantially lower          To further substantiate the above ideas, we selected from each cor-
values, with the BGH’s civil courts having the lowest ratio of 0.283.      pus (SCOTUS, BGH Zivil- and Strafsenat, EuroParl German and
   Next, we show how entropy varies over time in the SCOTUS                English) tokens with the lowest local entropy value ≤ 1. Fig. 4 in-
data. Fig. 3 shows the inverse compression ratio entropy measures          cludes word clouds for the lowest-entropy words in our vocabulary.
for the records of the U.S. Supreme Court in the last century. We             For the BGH (bottom left) one recognises key phrases from
can see that entropy has decreased since the 1950s, indicating an          procedural law such as, e.g. ‘zurückverweisen’ (to send back a
increase in the relative structure or predictability in the text.          request). We see technical language for civil cases, such as ‘Insol-
   This trend can be interpreted as a more formalised and standard-        venzverfahrens’ (bankruptcy proceeding). For the SCOTUS, we see
ised writing style. The shift could be due to the ongoing expansion        procedural, criminal and civil technical phrases such as ‘beyond
Entropy in Legal Language                                                                                                      NLLP @ KDD 2020, August 24th, San Diego, US


reasonable’ and ’qualified immunity’. For the EuroParl data, the                              article-10.1515-cllt-2018-0088/article-10.1515-cllt-2018-0088.xml
                                                                                          [6] MJ Bommarito DM Katz and J Blackman. 2017. A general approach for predicting
dominating lowest entropy phrases are procedural and related to                               the behavior of the Supreme Court of the United States. PLoS ONE 12, 4 (2017).
the Parliament’s sessions, such as, e.g. the German ‘siehe_Protokoll’                         https://doi.org/10.1371/journal.pone.0174698
which corresponds to the English ‘see_Minutes’.                                           [7] J. L. Hodges. 1958. The significance probability of the smirnov two-sample test.
                                                                                              Ark. Mat. 3, 5 (01 1958), 469–486. DOI:http://dx.doi.org/10.1007/BF02589501
   The very low entropy words, serve as functional foundations                            [8] Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language under-
in order to typify the respective environment and to set the tone.                            standing with Bloom embeddings, convolutional neural networks and incremen-
These reoccurring phrases have a very precise meaning, as the                                 tal parsing. (2017). To appear.
                                                                                          [9] D. Jurafsky and J. H. Martin. 2019. Speech and Language Processing (3 ed.). draft;
human reader recognises, and as quantitatively reflected in our                               https://web.stanford.edu/~jurafsky/slp3/.
neural model.                                                                            [10] U. Kamath, J. Liu, and J. Whitaker. 2019. Deep Learning for NLP and Speech
                                                                                              Recognition. Springer International Publishing. https://books.google.ch/books?
   An in-depth analysis of the precise distribution of the local en-                          id=8cmcDwAAQBAJ
tropies along the different linguistic axes, and the broader syntactic                   [11] Daniel Martin Katz and Michael James Bommarito. 2014. Measuring the com-
and semantic categories, is left for a separate publication.                                  plexity of the law: the United States Code. Artificial intelligence and law 22, 4
                                                                                              (2014), 337–374.
                                                                                         [12] Sara Klingenstein, Tim Hitchcock, and Simon DeDeo. 2014. The civilizing process
5    CONCLUSION                                                                               in London’s Old Bailey. Proceedings of the National Academy of Sciences 111, 26
                                                                                              (2014), 9419–9424. DOI:http://dx.doi.org/10.1073/pnas.1405984111
Our analysis has shown that the writing style in civil law has lower                     [13] Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Transla-
relative entropy than the common law, at least in the important                               tion. In Conference Proceedings: the tenth Machine Translation Summit. AAMT,
cases of the SCOTUS and BGH. We have shown this for two mea-                                  AAMT, Phuket, Thailand, 79–86. http://mt-archive.info/MTS-2005-Koehn.pdf
                                                                                         [14] I. Kontoyiannis, P. H. Algoet, Y. M. Suhov, and A. J. Wyner. 1998. Nonparametric
sures. First, local ambiguity, i.e. word entropy, produced using a                            entropy estimation for stationary processes and random fields, with applications
neural language model, and second, global entropy produced from a                             to English text. IEEE Transactions on Information Theory 44, 3 (1998), 1319–1327.
                                                                                         [15] Rafael La Porta, Florencio Lopez-de Silanes, and Andrei Shleifer. 2008. The
compression ratio algorithm. Civil and common law writing styles                              economic consequences of legal origins. Journal of economic literature 46, 2
are distinguishable on a purely information-theoretic base.                                   (2008), 285–332.
   The results are helpful from the perspectives of history and social                   [16] Omer Levy and Yoav Goldberg. 2014. Dependency-Based Word Embeddings.
                                                                                              In Proceedings of the 52nd Annual Meeting of the Association for Computational
science. The original German legal doctrine is very much rooted                               Linguistics (Volume 2: Short Papers). Association for Computational Linguistics,
in jurisprudence and has been strongly influenced, especially after                           Baltimore, Maryland, 302–308. DOI:http://dx.doi.org/10.3115/v1/P14-2050
the second half of the 19th century, by the development of natural                       [17] Pierre Lison and Andrey Kutuzov. 2017. Redefining Context Windows for Word
                                                                                              Embedding Models: An Experimental Study. In Proceedings of the 21st Nordic Con-
sciences. This systematic approach is reflected in the writing style.                         ference on Computational Linguistics. Association for Computational Linguistics,
Code-based legal writing requires, as argued above, efficient and                             Gothenburg, Sweden, 284–288. https://www.aclweb.org/anthology/W17-0239
                                                                                         [18] Edward Loper and Steven Bird. 2002. NLTK: The Natural Language Toolkit.
standardised mechanisms of referencing, common to all scientific                              In In Proceedings of the ACL Workshop on Effective Tools and Methodologies for
writing.                                                                                      Teaching Natural Language Processing and Computational Linguistics. Philadelphia:
   Our method innovates by using a neural language model, com-                                Association for Computational Linguistics.
                                                                                         [19] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
bined with data compression algorithms, in order to empirically                               Distributed Representations of Words and Phrases and their Compositionality. In
determine both word and stylistic ambiguity, i.e. local and global                            Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou,
entropy. This approach proves to be fruitful and could integrate nat-                         M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates,
                                                                                              Inc., 3111–3119. http://papers.nips.cc/paper/5021-distributed-representations-
urally into future enhancements of (deeper) neural language models.                           of-words-and-phrases-and-their-compositionality.pdf
In future work these could provide an even finer spatio-temporally                       [20] M.A. Montemurro and D. H. Zanette. 2011. Universal Entropy of Word Ordering
                                                                                              Across Linguistic Families. PLoS ONE 6, 5 (2011).
resolution of how information is distributed on different linguistic                     [21] Frederic Morin and Yoshua Bengio. 2005. Hierarchical Probabilistic Neural
scales and time, ranging from the word to the corpus level.                                   Network Language Model. In Proceedings of the Tenth International Workshop on
   In summary, our implementation and use of a local entropy                                  Artificial Intelligence and Statistics, Robert G. Cowell and Zoubin Ghahramani
                                                                                              (Eds.). Society for Artificial Intelligence and Statistics, 246–252. http://www.iro.
measure, based on a neural language model, has led to striking                                umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf
results that contribute to an old debate on legal traditions. The                        [22] Marcel Alexander Niggli and Louis Frédéric Muskens. 2014.                      BSK
contribution could be important both from a linguistic but also                               StGB-Niggli/Muskens, Art. 11.                 In Schweizerische Strafprozessord-
                                                                                              nung/Jugendstrafprozessordnung (StPO/JStPO) (2 ed.), Marianne Heer Marcel
legal perspective. We foresee a broad range of further applications.                          Alexander Niggli and Hans Wiprächtiger (Eds.). Vol. 1. Helbing & Lichtenhahn,
                                                                                              3501.
REFERENCES                                                                               [23] R.A. Posner. 2003. Economic Analysis of Law. Aspen Publishers. https://books.
                                                                                              google.ch/books?id=gyUkAQAAIAAJ
 [1] Elliott Ash and Daniel L. Chen. 2018. Mapping the Geometry of Law Using             [24] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling
     Document Embeddings.                                                                     with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges
 [2] Christian Bentz, Dimitrios Alikaniotis, Michael Cysouw, and Ramon Ferrer-i               for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/
     Cancho. 2017. The Entropy of Words—Learnability and Expressivity across More             884893/en.
     than 1000 Languages. Entropy 19, 6 (Jun 2017), 275. DOI:http://dx.doi.org/10.       [25] Xin Rong. 2014. word2vec Parameter Learning Explained. (2014). http://arxiv.
     3390/e19060275                                                                           org/abs/1411.2738 cite arxiv:1411.2738.
 [3] Keith Carlson, Michael A Livermore, and Daniel Rockmore. 2015-2016. A Quanti-       [26] George Sanchez. 2019. Sentence Boundary Detection in Legal Text. In Pro-
     tative Analysis of Writing Style on the U.S. Supreme Court. Washington University        ceedings of the Natural Legal Language Processing Workshop 2019. Associa-
     Law Review 93 (2015-2016), 1461.                                                         tion for Computational Linguistics, Minneapolis, Minnesota, 31–38. DOI:http:
 [4] Joseph Dainow. 1966. The Civil Law and the Common Law: Some Points of                    //dx.doi.org/10.18653/v1/W19-2204
     Comparison. The American Journal of Comparative Law 15, 3 (1966), 419–435.          [27] C. E. Shannon. 1951. Prediction and Entropy of Printed English. Bell System
     http://www.jstor.org/stable/838275                                                       Technical Journal 30, 1 (1951), 50–64. DOI:http://dx.doi.org/10.1002/j.1538-7305.
 [5] Stefania Degaetano-Ortlieb and Elke Teich. 2019. Toward an optimal code for              1951.tb01366.x
     communication: The case of scientific English. Corpus Linguistics and Linguistic    [28] J. Ziv and A. Lempel. 1977. A universal algorithm for sequential data compression.
     Theory 0 (2019). https://www.degruyter.com/view/journals/cllt/ahead-of-print/            IEEE Transactions on Information Theory 23, 3 (1977), 337–343.
NLLP @ KDD 2020, August 24th, San Diego, US                                                                   Roland Friedrich, Mauro Luzzatto, and Elliott Ash


A     THEORY                                                                or ambiguity is the map
Here we give a theoretical description of the steps underlying our                                      𝐻 :V        →         R+,
approach.                                                                                                    𝑤      ↦→        𝐻 (𝜇 𝑤 ),
A.1     Preprocessing                                                       which assigns to every token 𝑤 the Shannon entropy of the corre-
                                                                            sponding probability distribution 𝜇 𝑤 . The posterior distribution is
Let 𝐶 be a non-empty set, the corpus. For 𝑛 ∈ N, consider the map
                                                                            given by a Boltzmann distribution (softmax).
                              𝜋𝑛 : 𝐶 → 𝑉𝑛                                      It is calculated as follows. Let 𝑊 be the |V | × 𝑁 input weight
where 𝑉𝑛 is the, possibly empty, set of 𝑛-grams (associated to 𝐶),          matrix from the input layer to the hidden layer and 𝑊  e the 𝑁 × |V |
which satisfy 𝑉𝑘 ∩ 𝑉𝑙 = ∅, for 𝑙 ≠ 𝑘. Usually, the set of unigrams 𝑉1 ,     weight matrix from the hidden layer to the output layer in the
is called the vocabulary of the corpus 𝐶.                                   SkipGram model with hierarchical softmax.
   For a fixed 𝜈 ∈ N, set                                                      Every token 𝑤𝑖 ∈ V determines a pair of vectors (𝑣𝑖 , 𝑣˜𝑖 ), the
                                   Ø𝜈                                       input vector 𝑣𝑖 and the output vector 𝑣˜𝑖 , which are given by the 𝑖th
                            V𝜈 :=      𝑉𝑛                                   row of 𝑊 and the 𝑖th column of 𝑊   e , respectively.
                                     𝑛=1                                    3 More general, i.e. functional neighbourhoods are of course possible, e.g. based on
which is the set of (two-sided) uni-, bi-, tri- up to 𝜈-grams, and which,   grammatical information, as considered by Levy and Goldberg [16].
for 𝜈 large enough, yields an approximation (or pairwise disjoint              Let
                                                                                                                   |V |
decomposition) of the corpus 𝐶, which capture both syntactic and                                                   Õ
                                                                                                           𝑍𝑖 :=          𝑒 ⟨𝑣˜ 𝑗 |𝑣𝑖 ⟩                       (1)
semantic information.3 Then V𝜈 is the (generalised) vocabulary up
                                                                                                                   𝑗=1
to order 𝜈. The elements 𝑤 ∈ V𝜈 , or V if 𝜈 is fixed and clear from
                                                                            be the local partition function corresponding to the target 𝑤𝑖 , with
the context, are tokens or 𝑛-grams, which might be considered as
                                                                            the sum taken over all tokens 𝑤 𝑗 ∈ V. (We use the bra-ket nota-
𝑛-order words. We denote by |V | the size of V, i.e. the number of
                                                                            tion).
pairwise different tokens.
                                                                               For the SkipGram model with context 𝑐, the probability 𝜇 𝑤𝑖 (𝑤𝑜 )
   The family of maps 𝜋𝑛 , and hence the specific sets 𝑉𝑛 , determine
                                                                            of a token 𝑤𝑜 being an actual 𝑐-context output word of 𝑤𝑖 , is given
the preprocessing of the corpus data.
                                                                            by
                                                                                                                        1
A.2     Local Entropy from Word2Vec                                                         𝑝 (𝑤𝑜 |𝑤𝑖 ) := 𝜇 𝑤𝑖 (𝑤𝑜 ) := 𝑒 ⟨𝑣˜𝑜 |𝑣𝑖 ⟩ .       (2)
                                                                                                                        𝑍𝑖
The word2vec framework consists of a bundle of mathematical                    Therefore, the local entropy of the target 𝑤𝑖 (with context 𝑐) is
objects [19, 25]. First, it defines a dense Hilbert space representation,   given by
                      word2vec : V         →         R𝑁 ,                                                        |V |
                                                                                                                 Õ
                                    𝑤      ↦→        ℎ𝑤 ,                            𝐻 (𝑤𝑖 ) := 𝐻 (𝜇 𝑤𝑖 ) = −           𝑝 (𝑤 𝑗 |𝑤𝑖 ) · log2 (𝑝 (𝑤 𝑗 |𝑤𝑖 )).   (3)
                                                                                                                 𝑗=1
where 𝑁 ∈ N is the dimension of the coordinate space, which is
a hyper-parameter of the model. Let 𝔓(V) be denote the set of               A.3       Gensim Implementation
discrete probability distributions on V. Then, there exists a map
                                                                            We implemented our local entropy calculation for the SkipGram
                        𝑓𝑤2𝑣 : V     →        𝔓(V),                         model in Gensim, with the following parameters: context window=
                               𝑤     ↦→       𝜇𝑤 ,                          2, 𝑁 = 300 and 30 training epochs with hierarchical softmax [21].
                                                                               The output weight matrix 𝑊   e , and the input weight matrix 𝑊 ,
which associates to every token 𝑤 a probability distribution 𝜇 𝑤 ,
                                                                            are stored by Gensim in the files syn1 (for hierarchical softmax)
namely the posterior (multinomial) distribution. The local entropy
                                                                            and syn0, respectively. Note, if negative sampling is used, then the
                                                                            output weights are stored in syn1neg.