<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Entropy in Legal Language</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Roland Friedrich</string-name>
          <email>roland.friedrich@gess.ethz.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mauro Luzzatto</string-name>
          <email>mauroluzzatto@hotmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elliott Ash</string-name>
          <email>ashe@ethz.ch</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ETH Zürich</institution>
          ,
          <addr-line>Zürich</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ETH Zürich</institution>
          ,
          <addr-line>Zürich</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>ETH Zürich</institution>
          ,
          <addr-line>Zürich</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>We introduce a novel method to measure word ambiguity, i.e. local entropy, based on a neural language model. We use the measure to investigate entropy in the written text of opinions published by the U.S. Supreme Court (SCOTUS) and the German Bundesgerichtshof (BGH), representative courts of the common-law and civil-law court systems respectively. We compare the local (word) entropy measure with a global (document) entropy measure constructed with a compression algorithm. Our method uses an auxiliary corpus of parallel English and German to adjust for persistent diferences in entropy due to the languages. Our results suggest that the BGH's texts are of lower entropy than the SCOTUS's. Investigation of lowand high-entropy features suggests that the entropy diferential is driven by more frequent use of technical language in the German court.</p>
      </abstract>
      <kwd-group>
        <kwd>neural language models</kwd>
        <kwd>NLP</kwd>
        <kwd>Word2Vec</kwd>
        <kwd>entropy</kwd>
        <kwd>civil law</kwd>
        <kwd>common law</kwd>
        <kwd>judiciary</kwd>
        <kwd>comparative law</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The world’s legal systems feature two major traditions which have
spread to almost all countries. These systems are the “civil law”
as the continuation and refinement of the Roman “ jus civile”,
and the “common law”, as it originated in England after the Norman
conquest in 1066 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. To oversimplify somewhat, a broad distinction
of the systems is that at civil law judges make decisions from
codiifed rules, while in the common law judges make decisions based
on previous decisions.
      </p>
      <p>
        In civil law commentaries, cf. e.g. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], it is argued that common
law lacks a strong principled foundation. On this view, common law
is not systematised and without a general “strategy” but is rather
driven by “trial and error” on a case by case basis. On the other
hand, common law permits (judges) to adapt novel, pioneering
and innovative ideas or doctrines more easily, and, as Posner [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]
argued, it could be economically more eficient. Some evidence
suggests that nations that followed the common law system have
had better growth prospects than civil-law countries [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], although
whether this efect is causal is not well-established.
      </p>
      <p>A profered reason for the relative ineficiency of civil-law
institutions is that it is too rigid and cannot adapt well to changing
circumstances. Code-based decision-making requires complex
legislation that is costly to maintain, decipher, apply, and revise. These
points are anecdotal, and there is not much good empirical evidence
about them. Addressing these issues empirically is dificult because
you do not have both common-law and civil-law systems operating
in the same country. They also tend to be in diferent languages;
common-law countries tend to be English-speaking, while
LatinLanguage and German-Speaking countries tend to have civil law.
Perhaps foremost, we lack good measures of the complexity of the
law.</p>
      <p>Our goal is to produce some new measures of legal complexity in
a comparative framework. We draw on recent technologies in neural
language modeling to produce a new measure of local entropy at
the word level. We then map entropy levels across case texts in an
English-speaking common law court (the U.S. Supreme Court) and a
German-speaking civil law court (the German Bundesgerichtshof).</p>
      <p>The U.S. Supreme Court (SCOTUS) and German
Bundesgerichtshof (BGH) are the highest courts in the respective legal systems.
They are also two of the most influential judiciaries in the broader
system of international law. Within the common-law and civil-law
traditions, the SCOTUS and BGH are perhaps the most influential
high courts of the last century.</p>
      <p>
        We investigate the legal writing style of both the U.S. Supreme
Court (SCOTUS) and the Bundesgerichtshof (BGH) from an
information theoretic perspective, based on a neural language model.
Concretely, we build our method on top of Mikolov’s et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]
Word2Vec, in order to measure empirically the entropy at the token
level, i.e. the micro scale.
      </p>
      <p>
        We ask whether the two legal systems which these courts
represent can be discriminated, solely based on information theoretic
measures. We find that the BGH tends to have lower entropy than
the SCOTUS, reflecting greater use of low-entropy technical
language. Finally, in the case of the U.S. Supreme Court we further
investigate the temporal evolution of the entropy both at the micro
and macro level, by recording universal compression rates.
Shannon [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] in his seminal paper “Prediction and Entropy
of Printed English” initiated the information theoretic study of
natural languages. Similar to a theoretical physics approach,
Shannon applied the mathematical tools he had previously conceived to
understand information. That paper has led to a rich literature on
measuring the information content in written and spoken text.
      </p>
      <p>
        In this literature, a common and useful assumption is that
language is regular in the sense that the underlying stochastic data
generating process is both stationary and ergodic, cf. e.g. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Kontoyiannis et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] discuss various estimators for the Shannon
entropy rate of a stationary ergodic process, and apply them to
English texts. Most notable is the Lempel–Ziv [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] algorithm, which
consistently estimates the entropy lower bound for stationary
ergodic processes.
      </p>
      <p>
        A recent application of the Lempel-Ziv compression algorithm to
compare languages is Montemurro and Zanette [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. They quantify
the contribution of word ordering across diferent linguistic families
to see if diferent languages had diferent entropy properties. They
ifnd that the Kullback-Leibler divergence (diference in entropy)
between shufled and unshufled texts is a structural constant across
all languages considered.
      </p>
      <p>
        A complementary paper comparing languages at the word level
is Bentz et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. They undertake a series of computer experiments
to measure the word entropy across more than 1000 languages.
They use unigram entropies which they estimate statistically. They
ifnd that word entropies follow a narrow unimodal distribution.
      </p>
      <p>
        Degaetano-Ortlieb and Teich [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is an application looking at
changes in language entropy over time in a technical setting. They
investigate the linguistic development of scientific English, by
analysing the Royal Society Corpus (RSC) and the Corpus of Late
Modern English (CLMET) computationally. They consider -gram
language models (for  = 3) and track the temporal changes of the
Kullback-Leibler divergence, as a measure of local ambiguity. Their
main finding is that Scientific English, as it emerged over time,
resulted in an increasingly optimised code for written communication
by specialists.
2.2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Quantitative Analysis of Law</title>
      <p>
        Our paper adds to the emerging literature in computational legal
studies. Exemplary of this literature is Carlson, Livermore and
Rockmore [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], who study the writing style of the U.S. Supreme Court.
Katz et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] apply machine learning, combined with classical
statistical methods, as a novel approach to predict the behaviour of
the U.S. Supreme Court in a generalised, out-of-sample context.
      </p>
      <p>
        Klingenstein, Hitchcock, and DeDeo [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] take an
informationtheory approach to legal cases. They present a large-scale
quantitative analysis of transcripts of London’s Old Bailey. They use
the Jensen-Shannon divergence to show that trials for violent and
nonviolent ofenses become increasingly distinct. This divergence
reflects broader cultural shifts starting around 1800.
      </p>
      <p>
        The use of neural text embeddings in law is illustrated by Ash
and Chen [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. That paper investigates the use of legal language and
judicial reasoning in federal appellate courts, by using tools from
natural language processing (NLP) and dense vector
representations. They show that the resulting vector space geometry contains
information to distinguish court, time, and legal topics.
      </p>
      <p>
        The closest paper to ours is Katz and Bommarito [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. They
experiment with a number of methods for measuring complexity
in law, applied to U.S. federal statutes. They use measures of
language entropy based on word probabilities, but do not use word
embeddings.
      </p>
    </sec>
    <sec id="sec-3">
      <title>DATA AND METHODS</title>
      <p>The code used in this paper is available at:
https://github.com/MauroLuzzatto/legal-entropy.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Data</title>
      <p>Our analysis is based on the U.S. Supreme Court decisions from the
years 1924 to 2013, and the decisions of the German
Bundesgerichtshof (BGH), covering the years 2014 until 2019. We separated the
BGH data into rulings of the Zivil- and Strafsenat (civil and criminal
chambers).</p>
      <p>
        Additionally, as a baseline, we use Koehn’s [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] EuroParl parallel
corpus in German and English, consisting of the proceedings of the
European Parliament from 1996 to 2006.
      </p>
      <p>Some summary tabulations on the scope of the corpus are
reported in Table 1.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Pre-Processing</title>
      <p>
        For our analysis we use Python as well as spaCy [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and NLTK [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]
as our language processing tool.
      </p>
      <p>
        We apply the standard preprocessing steps in order to train the
Word2Vec model in Gensim – for details cf. [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. As an exception we
did not lemmatise and stem the tokens, and we kept capitalisation.
This makes English and German texts more comparable.
      </p>
      <p>We also used the phraser function from Gensim to treat idiomatic
bigrams, such as "New York", and trigrams, such as "New York City",
as single tokens.</p>
      <p>
        Deserving special mention is the determination of sentence
boundaries, a challenging task in legal writing [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. We found this
especially in the BGH civil case corpus, and less pronounced for the
U.S. Supreme Court and the EuroParl data. A multitude of
abbreviations, dates and most importantly statues involve a “dot”, leading
to a significant number of erroneous sentence tokens when the
standard NLTK sentence tokenizer is naively applied. Therefore,
before using nltk. sent_tokenize we removed all “dots” which do
not indicate a sentence boundary, by compiling a look-up table in
order to use it in conjunction with regular expression operations
(RegEx).
3.3
      </p>
    </sec>
    <sec id="sec-6">
      <title>Measuring Local Entropy using a Neural</title>
    </sec>
    <sec id="sec-7">
      <title>Language Model</title>
      <p>
        To train word embeddings we use Gensim’s [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] Word2Vec
implementation. Word2Vec is a popular word embedding algorithm
which uses a neural language model to predict local word
cooccurrence. A vector of predictive weights is learned, during the
model training, for each word in the vocabulary. These weight
vectors can be interpreted as the geometric location of the word in a
semantic space, where words that are near each other in the space
are semantically related.
      </p>
      <p>There are two architectural versions of Word2Vec, CBOW and
SkipGram. Simplified, in a CBOW model the neighbouring context
words are embedded to predict a left-out target word. In a SkipGram
model, the target word is embedded to predict whether a paired
word is sampled from the context or randomly sampled from outside
the context.</p>
      <p>Once trained, the Word2Vec model gives a predicted probability
distribution across words given a context. Out of the box, Gensim
ofers for the CBOW model a command which yields the
probability of a word to be a centre (target) word, depending on the
context words to be specified. For the purposes of this project, we
implemented the SkipGram version with hierarchical softmax. This
model can be considered as the (neural) generalisation of the
classical -gram. This serves as our base in order to determine the local
entropies.1</p>
      <p>
        The window size is a hyperparameter. Larger windows capture
more semantic relations whereas smaller windows tend to convey
syntactic information [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Our experiments showed that SkipGram
for a small context (window) size, e.g. | | = 2, showed better results
than the default window size (| | = 5).2
      </p>
      <p>For the discussion of the local entropy calculation and its
implementation, cf. Appendix A.</p>
      <p>For the Kolmogorov-Smirnov test we used SciPy.
3.4</p>
    </sec>
    <sec id="sec-8">
      <title>Measuring Global Entropy using</title>
    </sec>
    <sec id="sec-9">
      <title>Lempel-Ziv Compression</title>
      <p>The second entropy measure we compute uses the Lempel-Ziv
algorithm for sequential data. First, we compress the raw text using the
gzip compression module interface in Python, with the compression
level set to its maximum value (= 9).</p>
      <p>We define the compression ratio,  , of an individual text, txt , as
 := | gz|iptx(ttx|t ) | , where | | denotes the size as measured in bits. The
inverse ratio  −1 yields the fraction of the compressed file in
comparison to the original file. Note that  &gt; 0 for all documents  and
equivalently for the entire corpus. When considering compression
rates for individual texts and the entire corpus, one should keep in
mind the sub-additivity of the Shannon entropy.
4
4.1</p>
    </sec>
    <sec id="sec-10">
      <title>RESULTS</title>
    </sec>
    <sec id="sec-11">
      <title>Local Entropy of Words</title>
      <p>
        Our first analysis is to compare the distributions of the word
entropies across the diferent corpora. We would like to determine
the diferences in the distribution of the local entropy values of
the language used by the BGH’s Straf- and Zivilsenat and the U.S.
Supreme Court. To this end, Figure 1 plots the respective empirical
1For a detailed discussion of predicting a context word from a target word, see https:
//stackoverflow.com/questions/45102484/predict-middle-word-word2vec.
2A recent experimental study for SkipGram models by Lison and Kutuzov [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], found
that for semantic similarity tasks right-side contexts are more important than
leftside contexts, at least for English, and that the average model performance was not
significantly influenced by the removal of stop words.
cumulative distribution functions ECDFBGH-Z, ECDFBGH-Str and
ECDFSC.
      </p>
      <p>
        As can be seen in the figure, in the interval [
        <xref ref-type="bibr" rid="ref4">0, 4</xref>
        ] the
distributions of the BGH’s criminal chambers and the U.S. Supreme
Court are similar, whereas for entropy values  ≥ 4 we find that
ECDFBGH-Str ( ) &gt; ECDFSC ( ), i.e. the Strafsenat’s curve is strictly
above the U.S. Supreme Court’s.
      </p>
      <p>Comparing the Zivilsenat to the U.S. Supreme Court we find that
the diference between the ECDF curves of the Zivilsenat and the
U.S. Supreme Court is always strictly positive i.e. ECDFBGH-Z ( ) −
ECDFSC ( ) &gt; 0, for every  ∈ [0, max(entropy(BGH-Z))].
4.2</p>
    </sec>
    <sec id="sec-12">
      <title>Adjusting for English-German Language</title>
    </sec>
    <sec id="sec-13">
      <title>Diferences</title>
      <p>We use the EuroParl German corpus and its aligned English
translation as a baseline for two reasons. First, we want to gauge the
quality of our local entropy method. Second, we would like to
disentangle language-specific efects, i.e. English vs. German, when
comparing the U.S. Supreme Court to the BGH.</p>
      <p>Figure 2 demonstrates how the method behaves across languages
using the parallel, sentence aligned EuroParl German and English
corpora. As predicted by theory for a good translation, our method
yields two highly identical probability distributions (Left Panel).</p>
      <p>As seen in the Right Panel, the empirical cumulative distribution
functions of the local entropies are also very similar. It would be
interesting to further study the influence of -grams on the local
entropy distribution of translations.</p>
      <p>
        We quantified the distance between the empirical distribution
functions of the EuroParl English and German corpora via the
two-sided Kolmogorov–Smirnov test [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The null hypothesis 0
states that two observed and stochastically independent samples
are drawn from the same (continuous) distribution. We calculated
the value of the ECDF in steps of 1/10 in the interval [
        <xref ref-type="bibr" rid="ref16">0, 16</xref>
        ], i.e. the
range of the entropy values. The result for the -statistics is 0.069
and for the two-tailed -value 0.843, therefore we cannot reject 0.
      </p>
      <p>Second, the comparison with the baseline suggests, that as we
hypothesised the (one might even argue scientific) use of German
and English, respectively, in the courts has significantly less local
entropy, as compared to the more colloquial and non technical use
of the language in political speeches. This results in the strict local
ambiguity order</p>
      <p>ECDFBGH-Z ≺ ECDFBGH-Str ≺ ECDFSC ≺ ECDFEP-de,
and with ECDFEP-de ∼ ECDFEP-en.
4.3</p>
    </sec>
    <sec id="sec-14">
      <title>Global Entropy of Documents</title>
      <p>Now we produce the more global measure of entropy using the
compression-based measure. We estimated the macroscopic entropy
of the diferent corpora by compressing the entire raw text file for
each and then calculating the corresponding inverse compression
ratios, as described above. A higher value means that the corpus
has higher entropy per segment of text. Put diferently, a lower
value means that there is relatively more structure or predictability
in the underlying text features.</p>
      <p>Table 2 reports the compression ratios for each corpus. As
before, the values for the EuroParl corpora are almost identical, and
they have the highest entropy rate. This likely reflects the broader
diversity of issues covered in EuroParl relative to the law. The U.S.
Supreme Corpus has a slightly lower entropy rate. Meanwhile, the
BGH’s Strafsenat and Zivilsenat corpora yield substantially lower
values, with the BGH’s civil courts having the lowest ratio of 0.283.</p>
      <p>Next, we show how entropy varies over time in the SCOTUS
data. Fig. 3 shows the inverse compression ratio entropy measures
for the records of the U.S. Supreme Court in the last century. We
can see that entropy has decreased since the 1950s, indicating an
increase in the relative structure or predictability in the text.</p>
      <p>This trend can be interpreted as a more formalised and
standardised writing style. The shift could be due to the ongoing expansion
To further substantiate the above ideas, we selected from each
corpus (SCOTUS, BGH Zivil- and Strafsenat, EuroParl German and
English) tokens with the lowest local entropy value ≤ 1. Fig. 4
includes word clouds for the lowest-entropy words in our vocabulary.</p>
      <p>For the BGH (bottom left) one recognises key phrases from
procedural law such as, e.g. ‘zurückverweisen’ (to send back a
request). We see technical language for civil cases, such as
‘Insolvenzverfahrens’ (bankruptcy proceeding). For the SCOTUS, we see
procedural, criminal and civil technical phrases such as ‘beyond
reasonable’ and ’qualified immunity’. For the EuroParl data, the
dominating lowest entropy phrases are procedural and related to
the Parliament’s sessions, such as, e.g. the German ‘siehe_Protokoll’
which corresponds to the English ‘see_Minutes’.</p>
      <p>The very low entropy words, serve as functional foundations
in order to typify the respective environment and to set the tone.
These reoccurring phrases have a very precise meaning, as the
human reader recognises, and as quantitatively reflected in our
neural model.</p>
      <p>An in-depth analysis of the precise distribution of the local
entropies along the diferent linguistic axes, and the broader syntactic
and semantic categories, is left for a separate publication.
5</p>
    </sec>
    <sec id="sec-15">
      <title>CONCLUSION</title>
      <p>Our analysis has shown that the writing style in civil law has lower
relative entropy than the common law, at least in the important
cases of the SCOTUS and BGH. We have shown this for two
measures. First, local ambiguity, i.e. word entropy, produced using a
neural language model, and second, global entropy produced from a
compression ratio algorithm. Civil and common law writing styles
are distinguishable on a purely information-theoretic base.</p>
      <p>The results are helpful from the perspectives of history and social
science. The original German legal doctrine is very much rooted
in jurisprudence and has been strongly influenced, especially after
the second half of the 19th century, by the development of natural
sciences. This systematic approach is reflected in the writing style.
Code-based legal writing requires, as argued above, eficient and
standardised mechanisms of referencing, common to all scientific
writing.</p>
      <p>Our method innovates by using a neural language model,
combined with data compression algorithms, in order to empirically
determine both word and stylistic ambiguity, i.e. local and global
entropy. This approach proves to be fruitful and could integrate
naturally into future enhancements of (deeper) neural language models.
In future work these could provide an even finer spatio-temporally
resolution of how information is distributed on diferent linguistic
scales and time, ranging from the word to the corpus level.</p>
      <p>In summary, our implementation and use of a local entropy
measure, based on a neural language model, has led to striking
results that contribute to an old debate on legal traditions. The
contribution could be important both from a linguistic but also
legal perspective. We foresee a broad range of further applications.
A</p>
    </sec>
    <sec id="sec-16">
      <title>THEORY</title>
      <p>Here we give a theoretical description of the steps underlying our
approach.</p>
      <p>A.1</p>
    </sec>
    <sec id="sec-17">
      <title>Preprocessing</title>
      <p>Let  be a non-empty set, the corpus. For  ∈ N, consider the map
where  is the, possibly empty, set of -grams (associated to ),
which satisfy  ∩  = ∅, for  ≠ . Usually, the set of unigrams 1,
is called the vocabulary of the corpus .</p>
      <p>For a fixed  ∈ N, set
 :  → 
V :=
Ø</p>
      <p>=1
which is the set of (two-sided) uni-, bi-, tri- up to -grams, and which,
for  large enough, yields an approximation (or pairwise disjoint
decomposition) of the corpus , which capture both syntactic and
semantic information.3 Then V is the (generalised) vocabulary up
to order . The elements  ∈ V , or V if  is fixed and clear from
the context, are tokens or -grams, which might be considered as
-order words. We denote by |V | the size of V, i.e. the number of
pairwise diferent tokens.</p>
      <p>The family of maps , and hence the specific sets , determine
the preprocessing of the corpus data.</p>
      <p>A.2</p>
    </sec>
    <sec id="sec-18">
      <title>Local Entropy from Word2Vec</title>
      <p>
        The word2vec framework consists of a bundle of mathematical
objects [
        <xref ref-type="bibr" rid="ref19 ref25">19, 25</xref>
        ]. First, it defines a dense Hilbert space representation,
word2vec : V
→
      </p>
      <p>R ,
ℎ ,

↦→
where  ∈ N is the dimension of the coordinate space, which is
a hyper-parameter of the model. Let (V) be denote the set of
discrete probability distributions on V. Then, there exists a map
 2 : V</p>
      <p>→

(V),
 ,
↦→
which associates to every token  a probability distribution  ,
namely the posterior (multinomial) distribution. The local entropy
or ambiguity is the map
 : V

→</p>
      <p>R+,
 ( ),
↦→
which assigns to every token  the Shannon entropy of the
corresponding probability distribution  . The posterior distribution is
given by a Boltzmann distribution (softmax).</p>
      <p>It is calculated as follows. Let  be the |V | ×  input weight
matrix from the input layer to the hidden layer and e the  × |V |
weight matrix from the hidden layer to the output layer in the
SkipGram model with hierarchical softmax.</p>
      <p>
        Every token  ∈ V determines a pair of vectors (, ˜ ), the
input vector  and the output vector ˜ , which are given by the th
row of  and the th column of e , respectively.
3More general, i.e. functional neighbourhoods are of course possible, e.g. based on
grammatical information, as considered by Levy and Goldberg [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>Let
 :=
|V |
Õ  ⟨ ˜ |  ⟩
=1
be the local partition function corresponding to the target  , with
the sum taken over all tokens   ∈ V. (We use the bra-ket
notation).</p>
      <p>For the SkipGram model with context , the probability   ( )
of a token  being an actual -context output word of  , is given
by
 ( | ) :=   ( ) :=</p>
      <p>Therefore, the local entropy of the target  (with context ) is
given by
1  ⟨ ˜ |  ⟩ .
|V |
Õ
=1
 ( ) :=  (  ) = −
 (  | ) · log2 ( (  | )).</p>
      <p>(1)
(2)
(3)
A.3</p>
    </sec>
    <sec id="sec-19">
      <title>Gensim Implementation</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Elliott</given-names>
            <surname>Ash and Daniel L. Chen</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Mapping the Geometry of Law Using Document Embeddings</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bentz</surname>
          </string-name>
          , Dimitrios Alikaniotis,
          <source>Michael Cysouw, and Ramon Ferrer-i Cancho</source>
          .
          <year>2017</year>
          .
          <article-title>The Entropy of Words-Learnability and Expressivity across</article-title>
          <source>More than 1000 Languages. Entropy</source>
          <volume>19</volume>
          ,
          <issue>6</issue>
          (Jun
          <year>2017</year>
          ),
          <fpage>275</fpage>
          . DOI:http://dx.doi.org/10. 3390/e19060275
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Keith</given-names>
            <surname>Carlson</surname>
          </string-name>
          , Michael A Livermore, and
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Rockmore</surname>
          </string-name>
          .
          <year>2015</year>
          -
          <fpage>2016</fpage>
          .
          <article-title>A Quantitative Analysis of Writing Style on the U.S. Supreme Court</article-title>
          . Washington University Law Review
          <volume>93</volume>
          (
          <issue>2015-2016</issue>
          ),
          <fpage>1461</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Joseph</given-names>
            <surname>Dainow</surname>
          </string-name>
          .
          <year>1966</year>
          .
          <article-title>The Civil Law and the Common Law: Some Points of Comparison</article-title>
          .
          <source>The American Journal of Comparative Law</source>
          <volume>15</volume>
          ,
          <issue>3</issue>
          (
          <year>1966</year>
          ),
          <fpage>419</fpage>
          -
          <lpage>435</lpage>
          . http://www.jstor.org/stable/838275
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Stefania</given-names>
            <surname>Degaetano-Ortlieb</surname>
          </string-name>
          and
          <string-name>
            <given-names>Elke</given-names>
            <surname>Teich</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Toward an optimal code for communication: The case of scientific English</article-title>
          .
          <source>Corpus Linguistics and Linguistic Theory</source>
          <volume>0</volume>
          (
          <year>2019</year>
          ). https://www.degruyter.com/view/journals/cllt/ahead-of-print/ article-10.
          <fpage>1515</fpage>
          -cllt-2018-0088/article-10.
          <fpage>1515</fpage>
          -cllt-2018-0088.xml
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>MJ</given-names>
            <surname>Bommarito DM Katz and J Blackman</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A general approach for predicting the behavior of the Supreme Court of the United States</article-title>
          .
          <source>PLoS ONE 12</source>
          ,
          <issue>4</issue>
          (
          <year>2017</year>
          ). https://doi.org/10.1371/journal.pone.0174698
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Hodges</surname>
          </string-name>
          .
          <year>1958</year>
          .
          <article-title>The significance probability of the smirnov two-sample test</article-title>
          .
          <source>Ark. Mat. 3</source>
          ,
          <issue>5</issue>
          (
          <issue>01</issue>
          <year>1958</year>
          ),
          <fpage>469</fpage>
          -
          <lpage>486</lpage>
          . DOI:http://dx.doi.org/10.1007/BF02589501
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Honnibal</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ines</given-names>
            <surname>Montani</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing</article-title>
          . (
          <year>2017</year>
          ). To appear.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Martin</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <source>Speech and Language Processing</source>
          (3 ed.). draft; https://web.stanford.edu/~jurafsky/slp3/.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>U.</given-names>
            <surname>Kamath</surname>
          </string-name>
          , J. Liu, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Whitaker</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Deep Learning for NLP and Speech Recognition</article-title>
          . Springer International Publishing. https://books.google.ch/books? id=8cmcDwAAQBAJ
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Martin Katz</surname>
          </string-name>
          and Michael James Bommarito.
          <year>2014</year>
          .
          <article-title>Measuring the complexity of the law: the United States Code</article-title>
          .
          <source>Artificial intelligence and law 22</source>
          ,
          <issue>4</issue>
          (
          <year>2014</year>
          ),
          <fpage>337</fpage>
          -
          <lpage>374</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Sara</surname>
            <given-names>Klingenstein</given-names>
          </string-name>
          , Tim Hitchcock, and Simon DeDeo.
          <year>2014</year>
          .
          <article-title>The civilizing process in London's Old Bailey</article-title>
          .
          <source>Proceedings of the National Academy of Sciences</source>
          <volume>111</volume>
          ,
          <issue>26</issue>
          (
          <year>2014</year>
          ),
          <fpage>9419</fpage>
          -
          <lpage>9424</lpage>
          . DOI:http://dx.doi.org/10.1073/pnas.1405984111
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Koehn</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Europarl: A Parallel Corpus for Statistical Machine Translation</article-title>
          . In
          <source>Conference Proceedings: the tenth Machine Translation Summit. AAMT</source>
          , AAMT, Phuket, Thailand,
          <fpage>79</fpage>
          -
          <lpage>86</lpage>
          . http://mt-archive.info/MTS-2005-Koehn.pdf
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>I.</given-names>
            <surname>Kontoyiannis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. H.</given-names>
            <surname>Algoet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. M.</given-names>
            <surname>Suhov</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Wyner</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Nonparametric entropy estimation for stationary processes and random fields, with applications to English text</article-title>
          .
          <source>IEEE Transactions on Information Theory 44</source>
          ,
          <issue>3</issue>
          (
          <year>1998</year>
          ),
          <fpage>1319</fpage>
          -
          <lpage>1327</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Rafael</given-names>
            <surname>La</surname>
          </string-name>
          <string-name>
            <surname>Porta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Florencio</given-names>
            <surname>Lopez-de Silanes</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Andrei</given-names>
            <surname>Shleifer</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>The economic consequences of legal origins</article-title>
          .
          <source>Journal of economic literature 46</source>
          ,
          <issue>2</issue>
          (
          <year>2008</year>
          ),
          <fpage>285</fpage>
          -
          <lpage>332</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Omer</given-names>
            <surname>Levy</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yoav</given-names>
            <surname>Goldberg</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Dependency-Based Word Embeddings</article-title>
          .
          <source>In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers).</given-names>
          </string-name>
          <article-title>Association for Computational Linguistics</article-title>
          , Baltimore, Maryland,
          <fpage>302</fpage>
          -
          <lpage>308</lpage>
          . DOI:http://dx.doi.org/10.3115/v1/
          <fpage>P14</fpage>
          -2050
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Lison</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrey</given-names>
            <surname>Kutuzov</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Redefining Context Windows for Word Embedding Models: An Experimental Study</article-title>
          .
          <source>In Proceedings of the 21st Nordic Conference on Computational Linguistics</source>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Gothenburg, Sweden,
          <fpage>284</fpage>
          -
          <lpage>288</lpage>
          . https://www.aclweb.org/anthology/W17-0239
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Edward</given-names>
            <surname>Loper</surname>
          </string-name>
          and
          <string-name>
            <given-names>Steven</given-names>
            <surname>Bird</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>NLTK: The Natural Language Toolkit</article-title>
          . In
          <source>In Proceedings of the ACL Workshop on Efective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics</source>
          . Philadelphia: Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          , Ilya Sutskever, Kai Chen, Greg S Corrado, and
          <string-name>
            <given-names>Jef</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Distributed Representations of Words and Phrases and their Compositionality</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          26,
          <string-name>
            <surname>C. J. C. Burges</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Welling</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Ghahramani</surname>
            , and
            <given-names>K. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Weinberger</surname>
          </string-name>
          (Eds.). Curran Associates, Inc.,
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          . http://papers.nips.cc/paper/5021-distributed
          <article-title>-representationsof-words-and-phrases-and-their-compositionality</article-title>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.A.</given-names>
            <surname>Montemurro</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Zanette</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Universal Entropy of Word Ordering Across Linguistic Families</article-title>
          .
          <source>PLoS ONE 6</source>
          ,
          <issue>5</issue>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Frederic</given-names>
            <surname>Morin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Hierarchical Probabilistic Neural Network Language Model</article-title>
          .
          <source>In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics</source>
          , Robert G. Cowell and Zoubin Ghahramani (Eds.).
          <source>Society for Artificial Intelligence and Statistics</source>
          ,
          <volume>246</volume>
          -
          <fpage>252</fpage>
          . http://www.iro. umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>[22] Marcel Alexander Niggli and Louis Frédéric Muskens</source>
          .
          <year>2014</year>
          . BSK StGB-Niggli/Muskens, Art.
          <volume>11</volume>
          . In Schweizerische Strafprozessordnung/Jugendstrafprozessordnung (StPO/JStPO) (2 ed.),
          <source>Marianne Heer Marcel Alexander Niggli and Hans Wiprächtiger (Eds.)</source>
          . Vol.
          <volume>1</volume>
          .
          <string-name>
            <surname>Helbing</surname>
          </string-name>
          &amp; Lichtenhahn,
          <fpage>3501</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>R.A.</given-names>
            <surname>Posner</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Economic Analysis of Law</article-title>
          . Aspen Publishers. https://books. google.ch/books?id=gyUkAQAAIAAJ
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Radim</given-names>
            <surname>Řehůřek</surname>
          </string-name>
          and
          <string-name>
            <given-names>Petr</given-names>
            <surname>Sojka</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Software Framework for Topic Modelling with Large Corpora</article-title>
          .
          <source>In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA</source>
          , Valletta, Malta,
          <fpage>45</fpage>
          -
          <lpage>50</lpage>
          . http://is.muni.cz/publication/ 884893/en.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Xin</given-names>
            <surname>Rong</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>word2vec Parameter Learning Explained</article-title>
          . (
          <year>2014</year>
          ). http://arxiv. org/abs/1411.2738 cite arxiv:
          <volume>1411</volume>
          .
          <fpage>2738</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>George</given-names>
            <surname>Sanchez</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Sentence Boundary Detection in Legal Text</article-title>
          .
          <source>In Proceedings of the Natural Legal Language Processing Workshop</source>
          <year>2019</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Minneapolis, Minnesota,
          <fpage>31</fpage>
          -
          <lpage>38</lpage>
          . DOI:http: //dx.doi.org/10.18653/v1/
          <fpage>W19</fpage>
          -2204
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Shannon</surname>
          </string-name>
          .
          <year>1951</year>
          .
          <article-title>Prediction and Entropy of Printed English</article-title>
          .
          <source>Bell System Technical Journal 30</source>
          ,
          <issue>1</issue>
          (
          <year>1951</year>
          ),
          <fpage>50</fpage>
          -
          <lpage>64</lpage>
          . DOI:http://dx.doi.org/10.1002/j.1538-
          <fpage>7305</fpage>
          .
          <year>1951</year>
          .tb01366.x
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ziv</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Lempel</surname>
          </string-name>
          .
          <year>1977</year>
          .
          <article-title>A universal algorithm for sequential data compression</article-title>
          .
          <source>IEEE Transactions on Information Theory 23</source>
          ,
          <issue>3</issue>
          (
          <year>1977</year>
          ),
          <fpage>337</fpage>
          -
          <lpage>343</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>