<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MWCC: A Corpus of Malawi Criminal Cases</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Amelia V. Taylor University of Malawi, The Polytechnic and tNyasa Ltd</institution>
          ,
          <addr-line>Data Labs Blantyre</addr-line>
          ,
          <country country="MW">Malawi</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>We describe the creation of a corpus of criminal court judgments issued by the Malawian courts. We highlight opportunities and challenges in machine understanding of this text.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 INTRODUCTION</title>
      <p>This article presents the creation of a corpus of criminal case
judgments issued by appellate courts in Malawi and our experiments in
preparing this text to be used with machine learning algorithms.</p>
      <p>
        In Malawi, legal researchers face significant challenges in
accessing and searching for relevant information. The Malawi Judiciary
Development program that ran over the years 2003-2008, found that
“an inadequate provision of fundamental legal resources, such as
books, case reports, statute books and gazettes, greatly constrains
the performance of the judiciary in its administration of justice”. In
2013, the Malawi Judiciary, with funding from the European Union
introduced a case management system use in the High Court and
the Director for Public Prosecution [
        <xref ref-type="bibr" rid="ref18 ref6">6, 18</xref>
        ]. This new system has
improved the case registration process but sufers from bottlenecks
in processes and document logging; few case documents and final
judgments are stored on the system and most of these contain no
meta-data [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>In the last few years, MalawiLII1 provides online access to some
of the court judgments, laws and statues in Malawi 2. MalawiLII
does not support a system of citation that makes it possible to link
statutory law, case law and secondary law or to search by “legal
terms” and their specific interpretations.</p>
      <p>In view of these challenges, we started the development of an
automatic tool that (I) provides meta-data for criminal court
judgments on MalawiLII by demarcating their text into components
such as headers, introduction, body and conclusion and extracting
meta-data such as names of judges, dates, court of hearing; and
1malawilii.org
2Albeit not complete or up to date, this is an improved successor to the listing of
judgments that were initially done on the SNDP webpage, which listed judgments,
court documents, some of the laws and Constitution of Malawi http:\www.sdnp.org.
mw/index-archived.php
(II) that provides a useful classification of the judgments for legal
research.</p>
      <p>In this paper we describe the creation of the corpus used in
these two tasks and our results regarding the first. The paper is
structured as follows. In Section 2 we review relevant literature.
In Section 3 we describe the steps we took in creating the Malawi
Criminal Cases Corpus (MWCC) and discuss adding markup to
the files. In Section 4 we describe the types of annotations of law
and case citations we added to the corpus. In Section 5, by means
of examples from our corpus we illustrate challenges in machine
understanding of legal text. We conclude in Section 6 .
2</p>
    </sec>
    <sec id="sec-2">
      <title>LITERATURE REVIEW</title>
      <p>
        A corpus is ’a collection of examples of language in use that are
selected and compiled in a principled way’ [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. A list of corpora
containing legal text is given in [
        <xref ref-type="bibr" rid="ref23 ref26">23, 26</xref>
        ]. These vary in size and
genre coverage 3. A small number of the corpora listed specialise on
criminal judgments. However, these are not available or maintained
regularly and seem to have been developed to serve a specific
research objective in mind only: the HOLJ House of Lords Judgments
Corpus is a small, containing 188 texts, subset of the collection of
the House of Lords Judgments and was used for summarisation and
rhetorical structural annotation[
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ]. The Corpus de Sentencias
Penales 2005 - 2010 was used to study ’legal phraseology’ [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ].
      </p>
      <p>
        There are also clusters of research around some corpora, e.g.,
corpora of Italian legal text have been used in generating
dictionaries of legal terms [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], in analysing their usage [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and to assist
in translations [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Similarly, a corpus of Dutch legislation was
annotated using the Metalex XML scheme 4 and then enhanced
with meta-data regarding the document structure, external
descriptive data and law citations meta-data [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Corpora of Geek Tax
Legislation [
        <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
        ] and Greek Supreme Court decisions [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] were
enhanced with XML structural mark-up and annotations. These
projects did not use machine learning but made use of linguistic
features of the text, regular expressions or syntactic parsers and
grammars for data extraction.
      </p>
      <p>
        The process of machine understanding of legal text involves a
great deal of semantic enhancement, in order to make explicit or
machine understandable ’the flexibility, intuition and capabilities
of the conceptual structures of the human languages’ in readiness
for Web 4.0 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The reality is that most legal text that is being
made available online at present is in an unstructured form, i.e.,
3Some contain a wide range of types of text, e.g., academic journals, textbooks,
contracts, opinions, legislation, e.g., the American Law Corpus some contain only law
reports but cover several types of cases from administrative to criminal cases, e.g., the
Corpus of US Supreme Court Opinions, some contain legal text of historical importance
for example, the Old Bailey corpus is a historical corpus covering 197,745 texts over
1674 - 1913, some are multi-lingual like the ones covering European legislation, e.g.,
JRC-Acquis, Bononia Legal Corpus
4http:\www.metalex.eu/
has almost no meta-data, no annotations that makes it possible to
hyperlink it and ’machine understandable’. Hence, current work is
still largely focussed on taking a collection on unstructured text and
adding markup and annotations. As we are dealing with written text,
there is already an inherent organisation within the text itself. So in
this sense, ’adding structure’ means to extract from or externalise
this organisation in the form of markup or annotations on the text.
We included a discussion on this terminology in Appendix A. The
degree of ’organisation’ in legal text varies a lot, text containing
legislation is said to be more organised than that of court judgments,
and notarial contracts [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] are more organised than legislative texts,
and among legislations, those referring to tax and administrative
law [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] are more structured.
      </p>
      <p>
        In the case of court judgments, there are diferences in how
courts within the same country and or courts in diferent countries
structure the text. However, all court judgments share common
elements. They all contain, usually in their introduction, information
such as the courts of hearing, dates and case numbering or docket
numbers, names of the judges and other legal parties involved.
They all follow a certain legal rhetoric in which facts are presented,
points of law are discussed and finally the judgment is concluded.
There are also common conventions that are used for citing laws
and other cases. Some of these regularities were used to develop
and test algorithms that employ machine learning techniques to
’understand’ legal text, e.g., resolution of names of legal parties
such as judges [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], resolution of citations to laws or other cases
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], extracting citations to laws [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], automatic summarization of
court judgments [
        <xref ref-type="bibr" rid="ref14 ref7">7, 14</xref>
        ].
      </p>
      <p>
        The trend seems to be that researchers collect their own data
and use that to develop or test algorithms; a particular data set may
never be used in another study. Noting this trend, [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] sets out a
best-practice guide for the collection and analysis of legal corpora
for linguistic analysis to ensure a certain degree of generality of the
research results found when using a custom corpora. Generalisation
issues may come from the impact that the genre of the text within a
corpus has for example on the task of assigning meaning to terms,
e.g., collocates of "breach" across diferent corpora may belong to
diferent definitions/ meaning of the word.
      </p>
      <p>
        We think that there is a need to set our similar guidelines for the
use of legal corpora for machine learning purposes. For machine
learning algorithms, generalisation challenges can be even less
obvious because of the interplay between the impact of the language
models used, of the diferences in size and type or ’genre’ of the
training data versus the test data. An experiment that measures text
similarity of legal documents [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] showed that a word2vec model
was better than a bag of words model and the size of the training
data compared to that of the corpus impacts the accuracy of the
similarity results. However, to explain these results the authors
spend very little time describing their data apart from describing it
as ’selected larceny cases’.
      </p>
      <p>Putting together a corpus takes significant time and involves a
diversity of linguistic and computing skills. In building the MWCC
corpus we tried to ensure a certain ’separation of the corpus design’
from research design in order to ensure that other researchers will
use our corpus. We present the construction of a corpus of Malawi
criminal case judgments from a set of ’unstructured’ text files and
describe some of our experiments in adding markup and
annotations to the judgments. By means of examples from our corpus, we
reflect on the importance of cooperation between linguistic and
machine learning expertise in putting together legal corpora to
solve challenges in machine understanding of the legal text. For
reasons of space limitations we placed some important terminology
definition in Appendix A.
3</p>
    </sec>
    <sec id="sec-3">
      <title>THE MALAWI CRIMINAL CASES CORPUS (MWCC)</title>
      <p>
        We followed the guidelines of [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] in creating the corpus.
3.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>The Target Domain for MWCC Corpus</title>
      <p>The data for MWCC corpus is the criminal case judgments stored
in electronic doc, pdf and scanned images. These are obtained from
the High Court Library. The librarian scans the physical judgments
received from the High Court Registry, page by page, and stores
them as pdf files. The physical papers are then catalogued in folders
by year, and some of the scanned judgments are sent to law firms,
judges and other parties which subscribe to this service. These
are also uploaded to MalawiLII. The electronic scans have been
named by the High Court Librarian according to a convention:
[Case Name] [Case Type] [Case Number] [Case Year]. For example,
Lawrence Chibwana Vs The State Criminal Appeal No. 42 of 2010.pdf.
In some cases the name of the judge is also present in the title. In
some cases, the naming of files does not correspond to their content,
or names of parties have been misspelled.</p>
      <p>
        The names of cases as retrieved from file names can be used to
create a case citator database or if one exists to cross check them
against that. To our knowledge the Malawi High Court Library
does not maintain systematically a case citator database. For legal
researchers, it is important to know which of these cases have
been reported in oficial law reports as these receive a special
naming convention. Identifying a citation is only useful if that can be
’resolved’ and matched against an external knowledge source. A
manual search for prior cases typically involves formulating a query
(using party names, dates, docket numbers, and courts), retrieving
documents from a database of millions of opinions, and iterating the
process until the right cases are found. Challenges in case names
resolutions were discussed in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] where the authors describe the
development of a tool that provide automated assistance to the
citators of Thomson Legal and Regulatory. In some cases, a
citation cannot be resolved if there is no suficient data in the context
or if the judge refers to case documents that are not available or
numbered (e.g., references to afidavit documents attached to the
case).
3.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>The Design and Collection of the MWCC</title>
    </sec>
    <sec id="sec-6">
      <title>Corpus</title>
      <p>We collected 682 criminal court judgments issued over 2010-2019
by the High Court and the Supreme Court of Malawi. These were
stored as scanned images of physical documents. The files were
roughly organised on disk according to the year in which they
were issued. The steps we took in the preparation of the text for
the MWCC corpus are: (I) File cataloguing: re-name the files with
shorter names,remove special symbols, and maintain a mapping
for the naming. There was also a need to correct misspellings of
names of parties present in the title or remove duplicates of files.
There were also cases in which several cases were scanned together
and saved in the same one file, these we had to split. (II) Image
adjustments: straighten, remove watermarks, remove imperfections
due to the scanning process; (III) Batch OCR: Run page by page
OCR obtaining text corresponding to each line (word by word) in
the image, saving this in json files which also contain some text
formatting information, such as distances between lines, and font
sizes; (IV) Text Reconstruction and Corpus Creation: Reconstruct
the text from the files obtained by OCR and create the corpus files
in text and XML format. We used Python openCV to deal with
watermarks and markings on the text; and we wrote a Python
batch program to split and merge back the images, the ocr.space
API 5 for the OCR on the images, then we used custom python code
to process the json files returned by the OCR API.</p>
      <p>The image preparation stage could be improved by using
techniques for automatically detecting image features which, if known
in advance, can be useful for improving the quality of the OCR:
most judgments contain oficial stamps, some outside the text, some
on top of the text, most contain signatures of the judges or oficial
clerks. These can be isolated, or removed before the OCR.</p>
      <p>The most tricky part of the OCR process on these judgments was
the presence of headers, footers and footnotes. The headers usually
contained pagination and/or name of the case contained in the
document. The header could not be always removed automatically
based on text features, such as font size or distance to the main
body of the text, as in many cases the font was the same and the
headers were too close to the main text of the judgment as to appear
as a normal part of the text. The footnotes also cannot be removed
automatically because they contain relevant legal information. The
footnote example in Figure 1 contains several case citations, e.g.,
[1994] MLR 288 (HC) at 307. This is an incomplete citation where
one part, the case name, is in the main judgment text and the case
citation is in the footnote. The ocr.space API extracts all textual
information including the footnotes but these are not distinguished
from the rest of the text. Heuristics based on structural information
such as indentation, diferences in font sizes, distances from the
main text, could be used to recognise footnotes with some success.
5https:ocr.space</p>
      <p>Another challenge was the frequent use of quotations, where
a judge was discussing points relevant to the case at hand using
extracts from law or from relevant cases. Some quotations used
block quotes or other quotation marks. Others used indentation,
italics or syntactical clues by the use of specific keywords that
indicate their presence. It may be beneficial to use extra processing
steps (e.g., using Tesseract 6) to identify the presence of quotes in
the text and to mark these as special parts of the text flow.</p>
      <p>The electronic files of the corpus are structured into folders, one
for each year. Each judgment has three files corresponding to: text
ifle for introduction, text file for body with each paragraph being
on one line, TEI XML file with markup and judgment paragraphs.
We also have a separate file that maps the names of each file in the
corpus with the name of the raw data file.
3.3</p>
    </sec>
    <sec id="sec-7">
      <title>MWCC Corpus Statistics</title>
      <p>
        We can describe our corpus according to the criteria in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] as a
full text (each text in the corpus is unabridged), synchronic (covers
the period 2010 - 2019 and hence there is not a ’noticeable’ change
over this period in the way language is used or any change in the
vocabulary used), terminological (our text contains both general
and specific legal terms), monolingual (but containing names of
people, organisation, geographical places that are typical of Malawi).
The corpus contains 1,572,956 tokens, 1,374,635 words (a word may
appear more than once), 63,574 sentences and 22,124 paragraphs
extracted from 682 documents. There are 29,238 unique words, with
a lexical variation of 2.1%. Table 1 shows a breakdown of cases per
top 10 judges and Table 2 shows the breakdown of cases per year
and shows sizes of yearly sub-corpora.
      </p>
      <p>We used Sketchengine 7 to analyse the corpus in terms of part
of speech tags, word lists and collocations. Table 3 shows the main
part of speech frequencies for words that appear at least 5 times
and excluding non-words. These represent 80% of our corpus. The
percentage distribution are calculated on the whole corpus. Nouns,
verbs and prepositions appear quite frequently. We also notice
frequent use of adjectives; here are the top fifteen adjectives: criminal,
other,low,such,first, guilty, same, unreported, reasonable, maximum,
convict, public, present, appropriate, excessive. Several of these are
specific of the legal language. The top ten most frequent nouns
are all specific to the legal language: court, sentence, case, evidence,
ofence, appellant, section, person, court, theft .</p>
      <p>Using Sketchengine we could also analyse the language used
in our corpus compared to other corpora. In particular, we can
look at corpora built for the general English language use such as
the English Web corpus 2013, an English corpus made up of texts
collected from the Internet, containing 15 billion words. Compared
to this corpus, in ours, we see a much heavier use of prepositions
and a lesser use of verbs compared to nouns. We can also find those
n-grams or multi-words which appear frequently in our corpus and
very rarely in the comparison corpus, such as criminal procedure,
hard labour, maximum sentence, theft simpliciter, first ofender,
accused person, reasonable doubt. Such a comparison can be used to
extract features useful for a machine learning classification or topic
extraction analysis.
6https:\github.com/tesseract-ocr/tesseract
7https:\sketchengine.eu
analysis. Concordances are useful in finding out relevant
connections between words (modifiers of specific words) and also to reveal
multi-words units, e.g., detecting names of organisations, High
Court of Malawi, or names of legal functions such as court clerk,
attorney to state council, etc.</p>
      <p>A collocation is a sequence or a combination of words that occur
together more often that what would be expected by chance. The
strength of collocation is measured by the LogDice score (the higher
the code the higher the collocation). Words Collocations can help
understand the usage pattern of key legal terms, e.g., top modifiers
of murder as a verb are brutally, mercilessly, allegedly. These can
indicate the seriousness of the crime and or the intention. The
collocates of crime are consequence, ofender, alibi, criminal, circumstance
and the word ’criminal’ has the strongest collocations with
dangerous, hardened, unknown, hardcore, habitual. Collocates for key
legal terms can be used in topic extraction and the classification of
judgments.
3.4</p>
    </sec>
    <sec id="sec-8">
      <title>Adding Structural Markup to the MWCC</title>
    </sec>
    <sec id="sec-9">
      <title>Corpus</title>
      <p>We have two formats for the files of the corpus: (a) an all text format
and (b) an XML TEI format 8. All judgments contain a front cover
with information on the parties, the court of hearing, the dates
and number of the case, the coram who heard the case (includes
the judge, attorneys and other judicial clerks). It is possible to
automatically separate this part from the main body of the judgment.
In the text only format of the corpus, we keep separate files for the
introduction, an example is given in Figure 2, and separate files
for the paragraphs of the body of the judgment, each paragraph is
stored in one line of text.</p>
      <p>We based our separation of the introduction from the rest of the
body on algorithm that is (a) looking for the presence of specific
terms such as ORDER IN CONFIRMATION, RULING and (b) using
formatting diferences such as distances between lines of text used
in the introduction versus the rest of the text.</p>
      <p>Subcorpus of Introductions We thus obtained a sub-corpus made
up of only introduction parts of the judgments. Out of that, we
created a dictionary of legal keywords from all introductions (Table
8https:\tei-c.org/</p>
      <p>There is also language that is particular to certain judges, e.g.,
theft simpliciter is used mainly by judge D F MWAUNGULU.</p>
      <p>While word-lists and lists of keywords give us some useful
statistics about the composition of our corpus, they do not take into
account the context in which terms occur. When looking at a
specific sequence of tokens/ words, the context surrounding a keyword
is important. Such an analysis is called concordance or collocation
4 Appendix B) which were then used to extract the legal parties
involved in a case: such as name of the parties, judge, etc. This
external meta-data was then added into the XML version of the
corpus as meta-data for each judgment. An example of this
metadata is given in Appendix B.</p>
      <p>
        While our approach did not involve machine learning, there is
scope to use our sub-corpus to test supervised learning approaches
to extract this information. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the authors did something similar
to us in the sense that they extracted formatting features which
were later used in a supervised algorithm for extracting headings
from pdf documents.
3.5
      </p>
    </sec>
    <sec id="sec-10">
      <title>Chunking and Proper Names Recognition</title>
      <p>Chunking poses many challenges. Some judgments are very long
and may contain long paragraphs. Table 2 gives an indication of the
maximum average length of judgments per year: ranging from 90 to
over 200 tokens. We debated whether to store the text line by line,
to split it into sentences using punctuation or to group the text in
the same logical paragraphs as they were in the original images. We
opted for the latter. We wanted to make sure we capture situations
in which entities of interest break across lines. For example, in some
case citations, one line may contain the names of the parties and
another line, the court and dates. We used a heuristic based on the
distances between lines to re-arrange the text to match the original
paragraphs. We did not use punctuation to split into sentences
because the text contained many ’entities’ or elements which make
use of full-stops, e.g., numbers, references to sections of law.</p>
      <p>We used the POS tagging for extracting parts of our text which
was likely to contain references to laws and cases. The English
TreeTagger PoS tagset used by Sketchengine struggled with proper
nouns because legal text makes use of capitalisation of many words
for legal terms such as laws, e.g., Penal Code, legal parties, e.g.,
Appellant, or legal functions, e.g., Court Interpreter, references to
laws, e.g., "Section", or names or crimes, e.g., "Manslaughter". These
were usually tagged as nouns, but at times they were tagged as
proper nouns as in I/PP thus/RB convicted/VVD the/DT accused/VVN
of/IN the/DT ofence/NN of/IN Manslaughter/NP contrary/NN to/IN
Section/NP 208/CD of/IN the/DT Penal/NP Code/NP, or even verbs as
in Whereas/IN MUSATOPE/NP CHAPOTERA/NP was/VBD charged/
VVN with/IN the/DT ofence/NN of/IN murder/NN of/IN Yohane/NP
Makiyi/NP contrary/NN to/TO section/VV 209/CD of/IN the/DT
Penal/NP Code/NP. In this example, "section" is not capitalised, but it
is tagged as a verb possibly because of the presence of ‘to’ which
usually precedes the infinitive form of a verb. The shape NP-NP
is the most common for 2-grams in our text, and may correspond
for example to names of people or places, but also to legal terms
such as Appellant Andrew, Judge Mwase, legal bodies such as, High
Court, or Detective Sergeant, or names of laws, e.g., Drugs Act. It is
therefore important to have a way of distinguishing these legal
terms from the rest of the text to enable a more accurate tagging.
Using a list of relevant legal keywords and their use in context, may
help with improving the POS tagging for legal text. We hope to
look at evaluating legal-specific POS tagging methods in a future
research using the MWCC.</p>
      <p>
        The names of people, places and organisations which are
particular to Malawi are not easily recognised by existing language models.
The names used in Malawi are of Bantu origin [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] with European
influences, hence sometimes parts of names are recognised while
others are not. Names of people frequently appear in our text. We
will annotate our text with Bantu names of people and places. We
think that the MWCC can be used for building a training set, of
typical Bantu names to be used with recent advances in BERT and
transformers. For example [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] used the BERT model to recognise
names of entities in Bulgarian, Czech and Polish and in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] BERT
was used to recognise Chinese names.
4
4.1
      </p>
    </sec>
    <sec id="sec-11">
      <title>ADDING ANNOTATIONS TO THE MWCC</title>
    </sec>
    <sec id="sec-12">
      <title>CORPUS</title>
    </sec>
    <sec id="sec-13">
      <title>Law Citations</title>
      <p>There are several types of reference to laws found in our text. For
example, references containing only the name of the law/statue
The following ofences involving dishonesty in the Penal Code are
based on circumstances.... or ...the Control of Goods Act derives its
procedure in criminal matters from the Criminal Procedures and
Evidence Code.</p>
      <p>There are references containing labels and names of the law
Section 11 (2) of the Supreme Court of Appeal Act. or Section 283 of
the Penal Code.</p>
      <p>There are more complex types such as references by means of
anaphors spanning more than one line, or sentence, or paragraph.
Section 12 of the Act...
section of the same constitution ...
...in the Penal code...theft from a person (section 282(a)); theft from a
dwelling house (section 282 (b))..</p>
      <p>Appendix C gives a more comprehensive list. We annotated each
judgment with law citations: an example is given in Table 5 of
Appendix B.
4.2</p>
    </sec>
    <sec id="sec-14">
      <title>Case Citations</title>
      <p>Case citations may refer to cases published in oficial law reports or
to unpublished cases, each of these using diferent styles of citation.
A citation from the Malawi Law Report is:</p>
      <p>Republic v Chizumila and others [1994] MLR 288 (HC) at 307
where Republic v Chizumila and others are the parties involved
(also forming the case name), 1994 is the year of publication of the
Malawi Law Reports, 288 is the case number and 307 is the location.
Neutral citations were introduced in the UK in 2001 and are used
by MalawiLII. For example, on MalawiLII the case:
Dalikeni and Others v The Republic (MSCA Criminal Appeal Case</p>
      <p>No. 6 of 2016)
is numbered as: Dalikeni and Others v The Republic [2019] MWSC
8 where MWSC stands for Malawi Supreme Court and this is the
eighth case registered on MalawiLII under this court. An example
of unreported case is: Republic vs Mpinganjira Bagala HC/PR
confirmation case no. 24 of 2011 (unreported 11 July 2013) where HC/PR
stands for High Court Principal Registry.</p>
      <p>
        The presence of names of people or organisation means that
grammar rules or regular expressions cannot work on their own,
and could be combined with lookup and some form of supervised
learning. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] used a supervised statistical models to extract
standardised case citations of the type ’[1994] MLR 288’ from a selection
of 250 Pakistani court judgments. Their algorithms relied on
training data in which case citations were manually tagged using the
Inside-Out-Beginning notation. In a much larger project at
Thomson Legal and Regulatory [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], a ’citator’ database was available
(containing a list of all available names of cases) and the task was
to resolve the citations found into the citator. A Support Vector
Machine (SVM) was used to improve the accuracy of the entity (name
of cases) resolution. SVM were used also for entity resolution in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
to match names of judge/attorneys and names of legal firms from
text files with Westlaw records of attorney and legal firm files.
      </p>
      <p>We think that, the extraction of case citations could, in some
cases, be done directly from the scanned images, as most judges use
italics of bold font when writing such citations. Then, a supervised
algorithm that works on image data could be practical. However
as shown in Figure 3, the convention used in the documents of
our corpus is that only the case name is formatted diferently not
including the citation component. Some citations are partial, as
’Kachere and Nseula’ shown in the image, and need to be resolved
in context. In the next section we describe our experiments in
extracting law citations.
5</p>
    </sec>
    <sec id="sec-15">
      <title>EXPERIMENTS WITH SPACY</title>
      <p>Our corpus served as an excellent data set to test extracting law and
case citations and to generate test data for a supervised approach.
SpaCy (https:\spacy.io/) is a Python library using state of the art
neural networks for tagging, parsing and entity recognition. The
Named Entity Recogniser in spaCy already has an entity for "LAW".
For the English language, spaCy uses three models of varying sizes,
small (sm), medium (md) and large (lg) trained using Convolutional
Neural Networks on OneNotes 5.0 data set. The accuracy of the
spaCy NER was reported to be over 80% for both precision and
recall.</p>
      <p>Our approach was as follows: we first used the standard spaCy
NER to extract LAW entities, then we added an Entity Ruler to
extract additional LAW entities. For example the pattern in Figure 4
of Appendix D matches references to sections which use two-level
numbering, such as Section 4 (a) or s. 4 (2) or section 42(2) (f). We
used a Phrase Matcher based on a database of names of laws and
statues in Malawi to extract LAWNAMES entities. We then merged
these entities (e.g., the reference part merged with the law name)
into larger ones and eliminated duplicates.</p>
      <p>Most of the citations that are recognised by the standard SpaCy
NER are of the type: Section [number]. However, SpaCy recognition
depends on a uniform use of punctuation like spaces and full stops.
So for example, if there are extra spaces, e.g, Section 214 (a) instead
of Section 214(a), the entity will not be always recognised. Also
entities of the type Sections 339 and 340 will also not be consistently
recognised.</p>
      <p>References to laws of England or laws that are typically found in
other countries such as Data Protection Act, Oficial Secrets Act are
recognised as these were present in the model. However, names of
laws more particular to Malawi were not always recognised. Table
6 of Appendix D shows examples of law citations extracted using
SpaCy and a comparison between the use of the lg vs sm SpaCy
models: some entities which were found using the small model, sm,
were lost when using lg, but overall, the use of larger model did
result in a more accurate name identification of the law cited.</p>
      <p>Table 7 of Appendix D shows the citations we were able to
identify using in addition to the standard spaCy NER and then an
enhanced method using both an Entity Ruler and a Phrase Matcher.
The use of the Phrase Matcher allowed us to extract names of laws
which are specific to Malawi. With this combination, we managed
to find almost all the citations within the text. The phrase matcher
was used to locate the complete names of laws referred to in the
citations. For example, for the judgments of year 2010, spaCy NER
managed to extract 507 valid citations (some incomplete). Using the
enhanced process we extracted in total 1,162 which are citations
(e.g., Section 224 A) and names of laws (e.g., Penal Code). When
merged into full citations (e.g., Section 224 A of the Penal Code),
we obtained a total of 611 citations. For the whole corpus, spaCy
extracted 7,784 law citations out of a total of 18,929 obtained by the
enhanced method. Overall, we extracted 10,390 law citations from
our corpus. Thus, this process of extracting law citations worked
reasonably well and can be used in constructing a training set of
annotations for better results.</p>
      <p>The case and law citations are stored in separate TEI files, an
annotation file for each judgment file containing the paragraph, the
exact position inside a paragraph, the text of the annotation and its
type. The position of the annotations within a paragraph can also
be used to resolve incomplete citations or anaphors. Some of the
citations are incomplete and do not include the names of the law.
For example the reference section 235 (a) appears several times in
paragraphs 2 and 3, some occurrences do not contain the name of
the law. The context of the judgment and the classification of the
laws can help in the topic identification, e.g., section 235(a) of the
Penal code covers issues of causing grievous harm.
6</p>
    </sec>
    <sec id="sec-16">
      <title>CONCLUSION</title>
      <p>We described the process of creating a corpus of criminal cases
issued by Malawi courts. We reflected on the challenges and
opportunities in semantically enhancing this text and the need for
an intelligent pipeline that processes the text at all stages - some
of the semantic enhancement can be done on raw images as we
discussed for case citations. We would like to use our annotations
and corpus for further training and classification.
A</p>
    </sec>
    <sec id="sec-17">
      <title>SOME DEFINITIONS</title>
      <p>Markup. The markup adds what is usually called, external
information, meaning information about the text. Legal markup for court
judgments: case name, case number, court of hearing, date of case
registration, date of judgment, judge, legal parties such as appellant
and respondents, lawyers, court clerks.</p>
      <sec id="sec-17-1">
        <title>Simple Structural Annotation. The word structure is used to</title>
        <p>mean a particular general arrangement that is present in most texts.
The simplest arrangement can be one in which the text is arranged
in paragraphs, or a text may be arranged in chapters or sections, or
even more generally, as having the three main parts of introduction,
a body and a conclusion. These structural components follow a
tree-like hierarchy.</p>
        <p>Complex Structural Annotation. In this sense, structure is
dependent on the nature of the text. For example, a case judgment
typically has portions of text in which the facts of the case are
presented, followed by proceedings or the history of the case, e.g.,
previous rulings, a discussion of the relevant points of law and the
a conclusion for the case. Structure may also mean rhetorical styles
which are used in some part text.</p>
        <p>Legal Annotations. The annotation in this case refers to locating
specific pieces of text. This can be specific words, or phrases.
Usually the pieces of interest appear next to each other in the text, but
sometimes they do not. In the case of legal text, one is interested in
(a) legal terminology; (b) citations to laws and statues; (c) citations of
other cases.</p>
        <p>Legal Resolution. Annotations with case citations or law citations
need to be standardised so that documents can be hyperlinked.</p>
      </sec>
      <sec id="sec-17-2">
        <title>Legal Classification.</title>
        <p>ment of the text into a predefined list of categories according to</p>
        <p>This usually refers to a semantic
arrangea pre-established criteria. For example, court judgments can be
classified according to a court taxonomy, e.g., e.g., civil cases versus
criminal cases vs. commercial cases. Some classification criteria are
not linked to a taxonomy, e.g., one can classify court judgments
based on the type of crime it mostly deals with say theft versus
homicide.</p>
        <p>Topic Extraction. Topic extraction attempts to discover the most
important or relevant keywords in documents. so for example, one
would use this to check if the text at hand contains health advice or
a football match commentary. It is common to use topic extraction
in order to classify documents.
(Un)Structured Legal Text Legal text is by nature quite well
organise internally, however, by structured legal text we mean text
that contains some or all of the above. Unstructured legal text are
doc, pdf, scanned images of such documents that apart from being
stored electronically, do not contain any of the above.
B</p>
        <p>CORPUS FILES EXAMPLES
C</p>
      </sec>
    </sec>
    <sec id="sec-18">
      <title>TYPES OF LAW CITATIONS</title>
      <p>• References containing only the name of the law/statue
The following ofences involving dishonesty in the Penal Code
are based on circumstances.... or ...the Control of Goods Act
derives its procedure in criminal matters from the Criminal
Procedures and Evidence Code...
• References containing labels and names of the law
Section 11 (2) of the Supreme Court of Appeal Act. or Section
283 of the Penal Code.
• References containing labels and abbreviations, or additional
names in which a law is known (usually appears in brackets)
section 6 of the Control of Goods (Import and Export)
section 4 (d) of Part II of the Schedule to Bail (Guidelines) Act
s. 149 of CP&amp;EC
section 17(d) and 42 of the Liquid Fuel and Gas (Production
and Supply) Act
• References containing labels, names or abbreviations, and
the year or date applicable to the law
review of section 15 of the Code: it is commonplace that the
CP&amp;EC was amended in 2010
section 340(3) of the Proceeds of Crime Act 2002 (POCA)
• References to laws that are pertaining to other countries
(e.g., UK laws mentioned in Malawi court judgments)
section 145 of the New Zealand Crimes Act of 1961
ofences against the Person Act, 1861 as held in R v Dica [2004]
2 Cr. App. R. 28
• references by means of anaphors spanning more than one
line, or sentence, or paragraph.</p>
      <p>Section 12 of the Act...
section of the same constitution ...
...in the Penal code...theft from a person (section 282(a)); theft
from a dwelling house (section 282 (b))....
• References containing more than one label, number, e.g.,</p>
      <p>Section 2, 3 and 5 of ...</p>
      <p>D</p>
    </sec>
    <sec id="sec-19">
      <title>RESULTS OF THE SPACY EXPERIMENTS</title>
      <p>507
554
153
3,406
621
1,044
469
236
597
197
1,162
1,310
400
8,432
1,640
2,414
1,055
616
1,374
526</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Mikhail</given-names>
            <surname>Arkhipov</surname>
          </string-name>
          , Maria Trofimova, Yuri Kuratov, and
          <string-name>
            <given-names>Alexey</given-names>
            <surname>Sorokin</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Tuning Multilingual Transformers for Language-Specific Named Entity Recognition. Association for Computational Linguistics (ACL</article-title>
          ),
          <fpage>89</fpage>
          -
          <lpage>93</lpage>
          . https://doi.org/10. 18653/v1/w19-
          <fpage>3712</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Atkins</surname>
          </string-name>
          , Sue and Clear, Jeremy and Ostler, Nicholas.
          <year>1992</year>
          .
          <article-title>Corpus design criteria</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          <volume>7</volume>
          ,
          <issue>1</issue>
          (
          <year>1992</year>
          ). https://doi.org/10.1093/llc/7.1.
          <fpage>1</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Sahib</given-names>
            <surname>Singh</surname>
          </string-name>
          Budhiraja and
          <string-name>
            <given-names>Vijay</given-names>
            <surname>Mago</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>A supervised learning approach for heading detection</article-title>
          .
          <source>Expert Systems</source>
          (
          <year>2020</year>
          ). https://doi.org/10.1111/exsy.12520
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>María</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Buey</surname>
            , Angel Luis Garrido, Carlos Bobed, and
            <given-names>Sergio</given-names>
          </string-name>
          <string-name>
            <surname>Ilarri</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>The AIS project: Boosting information extraction from legal documents by using ontologies</article-title>
          .
          <source>In ICAART 2016 - Proceedings of the 8th International Conference on Agents and Artificial Intelligence</source>
          , Vol.
          <volume>2</volume>
          . https://doi.org/10.5220/0005757204380445
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Nuria</given-names>
            <surname>Casellas</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Semantic Enhancement of Legal Information</article-title>
          . . .
          <article-title>Are We Up for the Challenge? VoxPopuLII (</article-title>
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Winner</given-names>
            <surname>Dominic</surname>
          </string-name>
          <string-name>
            <surname>Chawinga</surname>
          </string-name>
          , Chaupe, Sellina Khumbo Kapondera, George Theodore Chipeta, Felix Majawa, and
          <string-name>
            <given-names>Chimango</given-names>
            <surname>Nyasulu</surname>
          </string-name>
          .
          <year>2020</year>
          ;.
          <article-title>Towards e-judicial services in Malawi: Implications for justice delivery</article-title>
          .
          <volume>86</volume>
          :
          <issue>e12121</issue>
          (
          <year>2020</year>
          ;),
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          . https://onlinelibrary.wiley.com/doi/epdf/10.1002/isd2.12121
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Min</given-names>
            <surname>Yuh</surname>
          </string-name>
          Day and Chao Yu Chen.
          <year>2018</year>
          .
          <article-title>Artificial intelligence for automatic text summarization</article-title>
          .
          <source>In Proceedings - 2018 IEEE 19th International Conference on Information Reuse</source>
          and
          <article-title>Integration for Data Science</article-title>
          ,
          <string-name>
            <surname>IRI</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Institute of Electrical and Electronics Engineers Inc</article-title>
          .,
          <fpage>478</fpage>
          -
          <lpage>484</lpage>
          . https://doi.org/10.1109/IRI.
          <year>2018</year>
          .00076
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Emile de Maat</surname>
          </string-name>
          , Radboud Winkels, and Tom van Engers.
          <year>2006</year>
          .
          <source>Automated Detection of Reference Structures in Law. In Legal Knowledge and Information Systems. Jurix 2006: The Nineteenth Annual Conference (Frontiers in Artificial Intelligence and Applications)</source>
          , Tom M van Engers (Ed.), Vol.
          <volume>152</volume>
          . IOS Press,
          <fpage>41</fpage>
          -
          <lpage>50</lpage>
          . http://www.leibnizcenter.org/docs/demaat/DeMaat-Jurix2006.pdf
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Dozier</surname>
          </string-name>
          , Ravikumar Kondadadi, Marc Light, Arun Vachher, Sriharsha Veeramachaneni, and
          <string-name>
            <given-names>Ramdev</given-names>
            <surname>Wudali</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Named entity recognition and resolution in legal text</article-title>
          .
          <source>In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)</source>
          , Vol.
          <volume>6036</volume>
          LNAI. https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -12837-
          <issue>0</issue>
          _
          <fpage>2</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.R.</given-names>
            <surname>Favretti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tamburini</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Martelli</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Words from Bononia Legal Corpus</article-title>
          .
          <source>International Journal of Corpus Linguistics</source>
          <volume>6</volume>
          ,
          <issue>1</issue>
          (
          <year>2007</year>
          ). https://doi.org/ 10.1075/ijcl.6.3.03ros
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>John</surname>
            <given-names>Garofalakis</given-names>
          </string-name>
          , Konstantinos Plessas, Athanasios Plessas, and
          <string-name>
            <given-names>Panoraia</given-names>
            <surname>Spiliopoulou</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Modelling Legal Documents for Their Exploitation as Open Data</article-title>
          .
          <source>In Lecture Notes in Business Information Processing</source>
          , Vol.
          <volume>353</volume>
          . https: //doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -20485-
          <issue>3</issue>
          _
          <fpage>3</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Patrizia</surname>
            <given-names>GIAMPIERI.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>the Bolc for Legal Translations: a Trial Lesson</article-title>
          .
          <source>Comparative Legilinguistics</source>
          <volume>39</volume>
          (dec
          <year>2019</year>
          ),
          <fpage>21</fpage>
          -
          <lpage>46</lpage>
          . https://doi.org/10.14746/cl.
          <year>2019</year>
          .
          <volume>39</volume>
          .2
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>CHENG</surname>
            <given-names>GONG</given-names>
          </string-name>
          ,
          <string-name>
            <surname>JIUYANG</surname>
            <given-names>TANG</given-names>
          </string-name>
          ,
          <string-name>
            <surname>SHENGWEI</surname>
            <given-names>ZHOU</given-names>
          </string-name>
          ,
          <string-name>
            <surname>ZEPENG</surname>
            <given-names>HAO</given-names>
          </string-name>
          , and
          <source>JUN WANG</source>
          .
          <year>2019</year>
          .
          <article-title>Chinese Named Entity Recognition with Bert</article-title>
          .
          <source>DEStech Transactions on Computer Science and Engineering cisnrc</source>
          (
          <year>2019</year>
          ). https://doi.org/10.12783/ dtcse/cisnrc2019/33299
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Claire</surname>
            <given-names>Grover</given-names>
          </string-name>
          , Ben Hachey, and
          <string-name>
            <given-names>Ian</given-names>
            <surname>Hughson</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>The HOLJ Corpus</article-title>
          .
          <source>Supporting Summarisation of Legal Texts. COLING 2004 5th International Workshop on Linguistically Interpreted Corpora</source>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Ben</given-names>
            <surname>Hachey</surname>
          </string-name>
          and
          <string-name>
            <given-names>Claire</given-names>
            <surname>Grover</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>A rhetorical status classifier for legal text summarisation</article-title>
          .
          <source>In In Proceedings of the ACL-2004 Text Summarization Branches Out Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Chu</given-names>
            <surname>Ren</surname>
          </string-name>
          Huang and
          <string-name>
            <surname>Yao</surname>
            <given-names>Yao</given-names>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Corpus Linguistics</article-title>
          .
          <source>In International Encyclopedia of the Social &amp; Behavioral Sciences: Second Edition</source>
          . Elsevier Inc.,
          <fpage>949</fpage>
          -
          <lpage>953</lpage>
          . https://doi.org/10.1016/B978-0
          <source>-08-097086-8</source>
          .
          <fpage>52004</fpage>
          -
          <lpage>2</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Jackson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Khalid</given-names>
            <surname>Al-Kofahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Alex</given-names>
            <surname>Tyrrell</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Arun</given-names>
            <surname>Vachher</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Information extraction from case law and retrieval of prior cases</article-title>
          .
          <source>In Artificial Intelligence</source>
          , Vol.
          <volume>150</volume>
          . https://doi.org/10.1016/S0004-
          <volume>3702</volume>
          (
          <issue>03</issue>
          )
          <fpage>00106</fpage>
          -
          <lpage>1</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Binart</given-names>
            <surname>Kachule</surname>
          </string-name>
          and
          <string-name>
            <given-names>Amelia</given-names>
            <surname>Taylor</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Understanding the Factors afecting the Utilisation of the Case Management System of the Malawi Judiciary Conference: EGPA 2018</article-title>
          ,
          <article-title>EGPA study group XVIII on justice and court administrationAt: Lausanne</article-title>
          , Switzerland.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Marios</surname>
            <given-names>Koniaris</given-names>
          </string-name>
          , George Papastefanatos, and
          <string-name>
            <given-names>Ioannis</given-names>
            <surname>Anagnostopoulos</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Solon: A holistic approach for modelling, managing and mining legal sources</article-title>
          .
          <source>Algorithms</source>
          <volume>11</volume>
          , 12 (dec
          <year>2018</year>
          ). https://doi.org/10.3390/a11120196
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Marios</surname>
            <given-names>Koniaris</given-names>
          </string-name>
          , George Papastefanatos, and
          <string-name>
            <given-names>Yannis</given-names>
            <surname>Vassiliou</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Towards automatic structuring and semantic indexing of legal documents</article-title>
          . In ACM International Conference Proceeding Series.
          <article-title>Association for Computing Machinery</article-title>
          . https://doi.org/10.1145/3003733.3003801
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Paola</given-names>
            <surname>Mariani</surname>
          </string-name>
          and
          <string-name>
            <given-names>Costanza</given-names>
            <surname>Badii</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Methods and techniques for building a digital historic-law dictionary</article-title>
          .
          <source>In Proceedings of the International Conference on Artificial Intelligence and Law</source>
          .
          <volume>230</volume>
          -
          <fpage>231</fpage>
          . https://doi.org/10.1145/1165485.1165523
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>James</surname>
            <given-names>C Phillips</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Jesse</given-names>
            <surname>Egbert</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Advancing Law and Corpus Linguistics: Importing Principles and Practices from Survey and Content-Analysis Methodologies to Improve Corpus Design and Analysis</article-title>
          .
          <source>Brigham Young University Law Review</source>
          <year>2017</year>
          ,
          <volume>6</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Gianluca</given-names>
            <surname>Pontrandolfo</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Legal Corpora: an overview</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Peter</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Raper</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Indigenous common names and toponyms in Southern Africa</article-title>
          .
          <source>Names</source>
          <volume>65</volume>
          ,
          <issue>4</issue>
          (
          <year>2017</year>
          ),
          <fpage>194</fpage>
          -
          <lpage>203</lpage>
          . https://doi.org/10.1080/00277738.
          <year>2017</year>
          .1369742
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Shahmin</surname>
            <given-names>Sharafat</given-names>
          </string-name>
          , Zara Nasar, and Syed Waqar Jafry.
          <year>2019</year>
          .
          <article-title>Data mining for smart legal systems</article-title>
          .
          <source>Computers &amp; Electrical Engineering</source>
          <volume>78</volume>
          (sep
          <year>2019</year>
          ),
          <fpage>328</fpage>
          -
          <lpage>342</lpage>
          . https://doi.org/10.1016/J.COMPELECENG.
          <year>2019</year>
          .
          <volume>07</volume>
          .017
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Friedemann</surname>
            <given-names>Vogel</given-names>
          </string-name>
          , Hanjo Hamann, and Isabelle Gauer.
          <year>2018</year>
          .
          <article-title>Computer-Assisted Legal Linguistics: Corpus Analysis as a New Tool for Legal Studies</article-title>
          .
          <source>Law and Social Inquiry</source>
          <volume>43</volume>
          ,
          <issue>4</issue>
          (
          <year>2018</year>
          ). https://doi.org/10.1111/lsi.12305
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>Chunyu</surname>
            <given-names>Xia</given-names>
          </string-name>
          , Tieke He,
          <string-name>
            <given-names>Wenlong</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Zemin</given-names>
            <surname>Qin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Zhipeng</given-names>
            <surname>Zou</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Similarity Analysis of Law Documents Based on Word2vec</article-title>
          .
          <source>In Proceedings - Companion of the 19th IEEE International Conference on Software Quality, Reliability and Security</source>
          , QRS-C
          <year>2019</year>
          . https://doi.org/10.1109/QRS-C.
          <year>2019</year>
          .00072
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>