MWCC: A Corpus of Malawi Criminal Cases
                                                                         Amelia V. Taylor
                                                               ataylor@poly.ac.mw
                                         University of Malawi, The Polytechnic and tNyasa Ltd, Data Labs
                                                                 Blantyre, Malawi

ABSTRACT                                                                               (II) that provides a useful classification of the judgments for legal
We describe the creation of a corpus of criminal court judgments                       research.
issued by the Malawian courts. We highlight opportunities and                              In this paper we describe the creation of the corpus used in
challenges in machine understanding of this text.                                      these two tasks and our results regarding the first. The paper is
                                                                                       structured as follows. In Section 2 we review relevant literature.
KEYWORDS                                                                               In Section 3 we describe the steps we took in creating the Malawi
                                                                                       Criminal Cases Corpus (MWCC) and discuss adding markup to
Legal corpus, Entity recognition, Text annotation and markup
                                                                                       the files. In Section 4 we describe the types of annotations of law
ACM Reference Format:                                                                  and case citations we added to the corpus. In Section 5, by means
Amelia V. Taylor. 2020. MWCC: A Corpus of Malawi Criminal Cases. In                    of examples from our corpus we illustrate challenges in machine
Proceedings of the 2020 Natural Legal Language Processing (NLLP) Workshop,             understanding of legal text. We conclude in Section 6 .
24 August 2020, San Diego, US. ACM, New York, NY, USA, 9 pages.
                                                                                       2     LITERATURE REVIEW
1     INTRODUCTION
                                                                                       A corpus is ’a collection of examples of language in use that are
This article presents the creation of a corpus of criminal case judg-                  selected and compiled in a principled way’ [16]. A list of corpora
ments issued by appellate courts in Malawi and our experiments in                      containing legal text is given in [23, 26]. These vary in size and
preparing this text to be used with machine learning algorithms.                       genre coverage 3 . A small number of the corpora listed specialise on
   In Malawi, legal researchers face significant challenges in access-                 criminal judgments. However, these are not available or maintained
ing and searching for relevant information. The Malawi Judiciary                       regularly and seem to have been developed to serve a specific re-
Development program that ran over the years 2003-2008, found that                      search objective in mind only: the HOLJ House of Lords Judgments
“an inadequate provision of fundamental legal resources, such as                       Corpus is a small, containing 188 texts, subset of the collection of
books, case reports, statute books and gazettes, greatly constrains                    the House of Lords Judgments and was used for summarisation and
the performance of the judiciary in its administration of justice”. In                 rhetorical structural annotation[14, 15]. The Corpus de Sentencias
2013, the Malawi Judiciary, with funding from the European Union                       Penales 2005 - 2010 was used to study ’legal phraseology’ [26].
introduced a case management system use in the High Court and                             There are also clusters of research around some corpora, e.g.,
the Director for Public Prosecution [6, 18]. This new system has                       corpora of Italian legal text have been used in generating dictionar-
improved the case registration process but suffers from bottlenecks                    ies of legal terms [21], in analysing their usage [10], and to assist
in processes and document logging; few case documents and final                        in translations [12]. Similarly, a corpus of Dutch legislation was
judgments are stored on the system and most of these contain no                        annotated using the Metalex XML scheme 4 and then enhanced
meta-data [18].                                                                        with meta-data regarding the document structure, external descrip-
   In the last few years, MalawiLII1 provides online access to some                    tive data and law citations meta-data [8]. Corpora of Geek Tax
of the court judgments, laws and statues in Malawi 2 . MalawiLII                       Legislation [19, 20] and Greek Supreme Court decisions [11] were
does not support a system of citation that makes it possible to link                   enhanced with XML structural mark-up and annotations. These
statutory law, case law and secondary law or to search by “legal                       projects did not use machine learning but made use of linguistic
terms” and their specific interpretations.                                             features of the text, regular expressions or syntactic parsers and
   In view of these challenges, we started the development of an                       grammars for data extraction.
automatic tool that (I) provides meta-data for criminal court judg-                       The process of machine understanding of legal text involves a
ments on MalawiLII by demarcating their text into components                           great deal of semantic enhancement, in order to make explicit or
such as headers, introduction, body and conclusion and extracting                      machine understandable ’the flexibility, intuition and capabilities
meta-data such as names of judges, dates, court of hearing; and                        of the conceptual structures of the human languages’ in readiness
1 malawilii.org                                                                        for Web 4.0 [5]. The reality is that most legal text that is being
2 Albeit not complete or up to date, this is an improved successor to the listing of   made available online at present is in an unstructured form, i.e.,
judgments that were initially done on the SNDP webpage, which listed judgments,
court documents, some of the laws and Constitution of Malawi http:\www.sdnp.org.       3 Some contain a wide range of types of text, e.g., academic journals, textbooks, con-
mw/index-archived.php
                                                                                       tracts, opinions, legislation, e.g., the American Law Corpus some contain only law
                                                                                       reports but cover several types of cases from administrative to criminal cases, e.g., the
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons   Corpus of US Supreme Court Opinions, some contain legal text of historical importance
License Attribution 4.0 International (CC BY 4.0).                                     for example, the Old Bailey corpus is a historical corpus covering 197,745 texts over
NLLP @ KDD 2020, August 24th, San Diego, US                                            1674 - 1913, some are multi-lingual like the ones covering European legislation, e.g.,
© 2020 Copyright held by the owner/author(s).                                          JRC-Acquis, Bononia Legal Corpus
                                                                                       4 http:\www.metalex.eu/
NLLP @ KDD 2020, August 24th, San Diego, US                                                                                      Amelia V. Taylor


has almost no meta-data, no annotations that makes it possible to         describe some of our experiments in adding markup and annota-
hyperlink it and ’machine understandable’. Hence, current work is         tions to the judgments. By means of examples from our corpus, we
still largely focussed on taking a collection on unstructured text and    reflect on the importance of cooperation between linguistic and
adding markup and annotations. As we are dealing with written text,       machine learning expertise in putting together legal corpora to
there is already an inherent organisation within the text itself. So in   solve challenges in machine understanding of the legal text. For
this sense, ’adding structure’ means to extract from or externalise       reasons of space limitations we placed some important terminology
this organisation in the form of markup or annotations on the text.       definition in Appendix A.
We included a discussion on this terminology in Appendix A. The
degree of ’organisation’ in legal text varies a lot, text containing      3     THE MALAWI CRIMINAL CASES CORPUS
legislation is said to be more organised than that of court judgments,          (MWCC)
and notarial contracts [4] are more organised than legislative texts,     We followed the guidelines of [22] in creating the corpus.
and among legislations, those referring to tax and administrative
law [8] are more structured.                                              3.1    The Target Domain for MWCC Corpus
    In the case of court judgments, there are differences in how
courts within the same country and or courts in different countries       The data for MWCC corpus is the criminal case judgments stored
structure the text. However, all court judgments share common ele-        in electronic doc, pdf and scanned images. These are obtained from
ments. They all contain, usually in their introduction, information       the High Court Library. The librarian scans the physical judgments
such as the courts of hearing, dates and case numbering or docket         received from the High Court Registry, page by page, and stores
numbers, names of the judges and other legal parties involved.            them as pdf files. The physical papers are then catalogued in folders
They all follow a certain legal rhetoric in which facts are presented,    by year, and some of the scanned judgments are sent to law firms,
points of law are discussed and finally the judgment is concluded.        judges and other parties which subscribe to this service. These
There are also common conventions that are used for citing laws           are also uploaded to MalawiLII. The electronic scans have been
and other cases. Some of these regularities were used to develop          named by the High Court Librarian according to a convention:
and test algorithms that employ machine learning techniques to            [Case Name] [Case Type] [Case Number] [Case Year]. For example,
’understand’ legal text, e.g., resolution of names of legal parties       Lawrence Chibwana Vs The State Criminal Appeal No. 42 of 2010.pdf.
such as judges [9], resolution of citations to laws or other cases        In some cases the name of the judge is also present in the title. In
[17], extracting citations to laws [25], automatic summarization of       some cases, the naming of files does not correspond to their content,
court judgments [7, 14].                                                  or names of parties have been misspelled.
    The trend seems to be that researchers collect their own data            The names of cases as retrieved from file names can be used to
and use that to develop or test algorithms; a particular data set may     create a case citator database or if one exists to cross check them
never be used in another study. Noting this trend, [22] sets out a        against that. To our knowledge the Malawi High Court Library
best-practice guide for the collection and analysis of legal corpora      does not maintain systematically a case citator database. For legal
for linguistic analysis to ensure a certain degree of generality of the   researchers, it is important to know which of these cases have
research results found when using a custom corpora. Generalisation        been reported in official law reports as these receive a special nam-
issues may come from the impact that the genre of the text within a       ing convention. Identifying a citation is only useful if that can be
corpus has for example on the task of assigning meaning to terms,         ’resolved’ and matched against an external knowledge source. A
e.g., collocates of "breach" across different corpora may belong to       manual search for prior cases typically involves formulating a query
different definitions/ meaning of the word.                               (using party names, dates, docket numbers, and courts), retrieving
    We think that there is a need to set our similar guidelines for the   documents from a database of millions of opinions, and iterating the
use of legal corpora for machine learning purposes. For machine           process until the right cases are found. Challenges in case names
learning algorithms, generalisation challenges can be even less           resolutions were discussed in [17] where the authors describe the
obvious because of the interplay between the impact of the language       development of a tool that provide automated assistance to the
models used, of the differences in size and type or ’genre’ of the        citators of Thomson Legal and Regulatory. In some cases, a cita-
training data versus the test data. An experiment that measures text      tion cannot be resolved if there is no sufficient data in the context
similarity of legal documents [27] showed that a word2vec model           or if the judge refers to case documents that are not available or
was better than a bag of words model and the size of the training         numbered (e.g., references to affidavit documents attached to the
data compared to that of the corpus impacts the accuracy of the           case).
similarity results. However, to explain these results the authors
spend very little time describing their data apart from describing it
                                                                          3.2    The Design and Collection of the MWCC
as ’selected larceny cases’.                                                     Corpus
    Putting together a corpus takes significant time and involves a       We collected 682 criminal court judgments issued over 2010-2019
diversity of linguistic and computing skills. In building the MWCC        by the High Court and the Supreme Court of Malawi. These were
corpus we tried to ensure a certain ’separation of the corpus design’     stored as scanned images of physical documents. The files were
from research design in order to ensure that other researchers will       roughly organised on disk according to the year in which they
use our corpus. We present the construction of a corpus of Malawi         were issued. The steps we took in the preparation of the text for
criminal case judgments from a set of ’unstructured’ text files and       the MWCC corpus are: (I) File cataloguing: re-name the files with
                                                                          shorter names,remove special symbols, and maintain a mapping
MWCC: A Corpus of Malawi Criminal Cases                                                                      NLLP @ KDD 2020, August 24th, San Diego, US


                                                                            Another challenge was the frequent use of quotations, where
      Figure 1: Example of footnotes in court judgments.                a judge was discussing points relevant to the case at hand using
                                                                        extracts from law or from relevant cases. Some quotations used
                                                                        block quotes or other quotation marks. Others used indentation,
                                                                        italics or syntactical clues by the use of specific keywords that
                                                                        indicate their presence. It may be beneficial to use extra processing
                                                                        steps (e.g., using Tesseract 6 ) to identify the presence of quotes in
                                                                        the text and to mark these as special parts of the text flow.
                                                                            The electronic files of the corpus are structured into folders, one
                                                                        for each year. Each judgment has three files corresponding to: text
                                                                        file for introduction, text file for body with each paragraph being
                                                                        on one line, TEI XML file with markup and judgment paragraphs.
                                                                        We also have a separate file that maps the names of each file in the
                                                                        corpus with the name of the raw data file.

                                                                        3.3     MWCC Corpus Statistics
for the naming. There was also a need to correct misspellings of        We can describe our corpus according to the criteria in [2] as a
names of parties present in the title or remove duplicates of files.    full text (each text in the corpus is unabridged), synchronic (covers
There were also cases in which several cases were scanned together      the period 2010 - 2019 and hence there is not a ’noticeable’ change
and saved in the same one file, these we had to split. (II) Image       over this period in the way language is used or any change in the
adjustments: straighten, remove watermarks, remove imperfections        vocabulary used), terminological (our text contains both general
due to the scanning process; (III) Batch OCR: Run page by page          and specific legal terms), monolingual (but containing names of
OCR obtaining text corresponding to each line (word by word) in         people, organisation, geographical places that are typical of Malawi).
the image, saving this in json files which also contain some text       The corpus contains 1,572,956 tokens, 1,374,635 words (a word may
formatting information, such as distances between lines, and font       appear more than once), 63,574 sentences and 22,124 paragraphs
sizes; (IV) Text Reconstruction and Corpus Creation: Reconstruct        extracted from 682 documents. There are 29,238 unique words, with
the text from the files obtained by OCR and create the corpus files     a lexical variation of 2.1%. Table 1 shows a breakdown of cases per
in text and XML format. We used Python openCV to deal with              top 10 judges and Table 2 shows the breakdown of cases per year
watermarks and markings on the text; and we wrote a Python              and shows sizes of yearly sub-corpora.
batch program to split and merge back the images, the ocr.space            We used Sketchengine 7 to analyse the corpus in terms of part
API 5 for the OCR on the images, then we used custom python code        of speech tags, word lists and collocations. Table 3 shows the main
to process the json files returned by the OCR API.                      part of speech frequencies for words that appear at least 5 times
   The image preparation stage could be improved by using tech-         and excluding non-words. These represent 80% of our corpus. The
niques for automatically detecting image features which, if known       percentage distribution are calculated on the whole corpus. Nouns,
in advance, can be useful for improving the quality of the OCR:         verbs and prepositions appear quite frequently. We also notice fre-
most judgments contain official stamps, some outside the text, some     quent use of adjectives; here are the top fifteen adjectives: criminal,
on top of the text, most contain signatures of the judges or official   other,low,such,first, guilty, same, unreported, reasonable, maximum,
clerks. These can be isolated, or removed before the OCR.               convict, public, present, appropriate, excessive. Several of these are
   The most tricky part of the OCR process on these judgments was       specific of the legal language. The top ten most frequent nouns
the presence of headers, footers and footnotes. The headers usually     are all specific to the legal language: court, sentence, case, evidence,
contained pagination and/or name of the case contained in the           offence, appellant, section, person, court, theft.
document. The header could not be always removed automatically             Using Sketchengine we could also analyse the language used
based on text features, such as font size or distance to the main       in our corpus compared to other corpora. In particular, we can
body of the text, as in many cases the font was the same and the        look at corpora built for the general English language use such as
headers were too close to the main text of the judgment as to appear    the English Web corpus 2013, an English corpus made up of texts
as a normal part of the text. The footnotes also cannot be removed      collected from the Internet, containing 15 billion words. Compared
automatically because they contain relevant legal information. The      to this corpus, in ours, we see a much heavier use of prepositions
footnote example in Figure 1 contains several case citations, e.g.,     and a lesser use of verbs compared to nouns. We can also find those
[1994] MLR 288 (HC) at 307. This is an incomplete citation where        n-grams or multi-words which appear frequently in our corpus and
one part, the case name, is in the main judgment text and the case      very rarely in the comparison corpus, such as criminal procedure,
citation is in the footnote. The ocr.space API extracts all textual     hard labour, maximum sentence, theft simpliciter, first offender, ac-
information including the footnotes but these are not distinguished     cused person, reasonable doubt. Such a comparison can be used to
from the rest of the text. Heuristics based on structural information   extract features useful for a machine learning classification or topic
such as indentation, differences in font sizes, distances from the      extraction analysis.
main text, could be used to recognise footnotes with some success.
                                                                        6 https:\github.com/tesseract-ocr/tesseract
5 https:ocr.space                                                       7 https:\sketchengine.eu
NLLP @ KDD 2020, August 24th, San Diego, US                                                                                         Amelia V. Taylor


Table 1: Malawi Criminal Cases in MWCC by top 10 judges
(out of a total of 35 judges) in order of number of judgments                     Figure 2: Introduction Part for Judgment 1 of 2013
issued.]                                                                                               JUDICIARY
                                                                                           IN THE HIGH COURT OF MALAWI
            Judge Name                             No. Cases                                      PRINCIPAL REGISTRY
                                                                                         CONFIRMATION CASE NO 689 OF 2013
            CHIRWA, J. M.                          106                          Being Criminal Case No. 719 of 2013 from the Second Grade
            KAMANGA-NYAKAUNDA, D.                  65                                     Magistrate Court Sitting at Chikhwawa
            KAMWAMBE, M.L.                         71                                                THE REPUBLIC
            KALEMBERA, S.A.                        25                                                     Versus
            MADISE, D.T.K.                         45                                                MATEYU THOM
            MBVUNDULA, R.                          28                             THE HONOURABLE JUSTICE KENYATTA NYIRENDA
            MWAUNGULU, D.F.                        81                             Margaret Munthali, Senior State Advocate, for the State
            NYERENDA, K.                           51                                     Accused person, Absent/unrepresented
            SIKWESE, R. S.                         37                                    Mrs. D. Mtegha, Official Court Interpreter
            Percentage of Total (627/682)          92%                                         ORDER IN CONFIRMATION

                                                                            analysis. Concordances are useful in finding out relevant connec-
         Table 2: Composition of the MWCC by year
                                                                            tions between words (modifiers of specific words) and also to reveal
                                                                            multi-words units, e.g., detecting names of organisations, High
 Year     No. Cases     Tokens        No. Parag.    MAx. Avg. Parag. Len.   Court of Malawi, or names of legal functions such as court clerk,
 2010     85            162,960       2,096         232                     attorney to state council, etc.
 2011     72            155,154       2,959         131                        A collocation is a sequence or a combination of words that occur
 2012     20            54,149        720           189                     together more often that what would be expected by chance. The
 2013     162           426,584       6,840         200                     strength of collocation is measured by the LogDice score (the higher
 2014     85            141,115       2,066         96                      the code the higher the collocation). Words Collocations can help
 2015     122           274,583       3,538         131                     understand the usage pattern of key legal terms, e.g., top modifiers
 2016     46            106,069       1,273         128                     of murder as a verb are brutally, mercilessly, allegedly. These can
 2017     27            52,038        810           42                      indicate the seriousness of the crime and or the intention. The collo-
 2018     42            153,572       1,454         157                     cates of crime are consequence, offender, alibi, criminal, circumstance
 2019     21            46,732        368           223                     and the word ’criminal’ has the strongest collocations with dan-
                                                                            gerous, hardened, unknown, hardcore, habitual. Collocates for key
 Total    682           1,572,956     22,124        232
                                                                            legal terms can be used in topic extraction and the classification of
                                                                            judgments.
Table 3: Main Parts-of-Speech (Items with frequencies
higher than 5.) This represent 80% of our corpus.                           3.4      Adding Structural Markup to the MWCC
                                                                                     Corpus
  Part of Speech     No. Items (Lemmas)        Freq.      Distribution      We have two formats for the files of the corpus: (a) an all text format
                                                                            and (b) an XML TEI format 8 . All judgments contain a front cover
  Noun               3,959                     393,777    29%               with information on the parties, the court of hearing, the dates
  Verb               1,247                     231,126    17%               and number of the case, the coram who heard the case (includes
  Adjectives         953                       75,894     6%                the judge, attorneys and other judicial clerks). It is possible to
  Adverbs            448                       58,793     4%                automatically separate this part from the main body of the judgment.
  Prepositions       81                        223,172    16%               In the text only format of the corpus, we keep separate files for the
  Conjunctions       13                        43,975     3%                introduction, an example is given in Figure 2, and separate files
  Pronouns           29                        53,546     4%                for the paragraphs of the body of the judgment, each paragraph is
  Numerals           27                        7,702      1%                stored in one line of text.
                                                                               We based our separation of the introduction from the rest of the
                                                                            body on algorithm that is (a) looking for the presence of specific
                                                                            terms such as ORDER IN CONFIRMATION, RULING and (b) using
    There is also language that is particular to certain judges, e.g.,
                                                                            formatting differences such as distances between lines of text used
theft simpliciter is used mainly by judge D F MWAUNGULU.
                                                                            in the introduction versus the rest of the text.
    While word-lists and lists of keywords give us some useful sta-
                                                                               Subcorpus of Introductions We thus obtained a sub-corpus made
tistics about the composition of our corpus, they do not take into
                                                                            up of only introduction parts of the judgments. Out of that, we
account the context in which terms occur. When looking at a spe-
                                                                            created a dictionary of legal keywords from all introductions (Table
cific sequence of tokens/ words, the context surrounding a keyword
is important. Such an analysis is called concordance or collocation         8 https:\tei-c.org/
MWCC: A Corpus of Malawi Criminal Cases                                                                 NLLP @ KDD 2020, August 24th, San Diego, US


4 Appendix B) which were then used to extract the legal parties          The names used in Malawi are of Bantu origin [24] with European
involved in a case: such as name of the parties, judge, etc. This        influences, hence sometimes parts of names are recognised while
external meta-data was then added into the XML version of the            others are not. Names of people frequently appear in our text. We
corpus as meta-data for each judgment. An example of this meta-          will annotate our text with Bantu names of people and places. We
data is given in Appendix B.                                             think that the MWCC can be used for building a training set, of
   While our approach did not involve machine learning, there is         typical Bantu names to be used with recent advances in BERT and
scope to use our sub-corpus to test supervised learning approaches       transformers. For example [1] used the BERT model to recognise
to extract this information. In [3], the authors did something similar   names of entities in Bulgarian, Czech and Polish and in [13] BERT
to us in the sense that they extracted formatting features which         was used to recognise Chinese names.
were later used in a supervised algorithm for extracting headings
from pdf documents.                                                      4  ADDING ANNOTATIONS TO THE MWCC
                                                                            CORPUS
3.5    Chunking and Proper Names Recognition                             4.1 Law Citations
Chunking poses many challenges. Some judgments are very long             There are several types of reference to laws found in our text. For
and may contain long paragraphs. Table 2 gives an indication of the      example, references containing only the name of the law/statue
maximum average length of judgments per year: ranging from 90 to         The following offences involving dishonesty in the Penal Code are
over 200 tokens. We debated whether to store the text line by line,      based on circumstances.... or ...the Control of Goods Act derives its
to split it into sentences using punctuation or to group the text in     procedure in criminal matters from the Criminal Procedures and
the same logical paragraphs as they were in the original images. We      Evidence Code.
opted for the latter. We wanted to make sure we capture situations           There are references containing labels and names of the law
in which entities of interest break across lines. For example, in some   Section 11 (2) of the Supreme Court of Appeal Act. or Section 283 of
case citations, one line may contain the names of the parties and        the Penal Code.
another line, the court and dates. We used a heuristic based on the          There are more complex types such as references by means of
distances between lines to re-arrange the text to match the original     anaphors spanning more than one line, or sentence, or paragraph.
paragraphs. We did not use punctuation to split into sentences           Section 12 of the Act...
because the text contained many ’entities’ or elements which make        section of the same constitution ...
use of full-stops, e.g., numbers, references to sections of law.         ...in the Penal code...theft from a person (section 282(a)); theft from a
    We used the POS tagging for extracting parts of our text which       dwelling house (section 282 (b))..
was likely to contain references to laws and cases. The English              Appendix C gives a more comprehensive list. We annotated each
TreeTagger PoS tagset used by Sketchengine struggled with proper         judgment with law citations: an example is given in Table 5 of
nouns because legal text makes use of capitalisation of many words       Appendix B.
for legal terms such as laws, e.g., Penal Code, legal parties, e.g.,
Appellant, or legal functions, e.g., Court Interpreter, references to    4.2    Case Citations
laws, e.g., "Section", or names or crimes, e.g., "Manslaughter". These   Case citations may refer to cases published in official law reports or
were usually tagged as nouns, but at times they were tagged as           to unpublished cases, each of these using different styles of citation.
proper nouns as in I/PP thus/RB convicted/VVD the/DT accused/VVN         A citation from the Malawi Law Report is:
of/IN the/DT offence/NN of/IN Manslaughter/NP contrary/NN to/IN              Republic v Chizumila and others [1994] MLR 288 (HC) at 307
Section/NP 208/CD of/IN the/DT Penal/NP Code/NP, or even verbs as
                                                                         where Republic v Chizumila and others are the parties involved
in Whereas/IN MUSATOPE/NP CHAPOTERA/NP was/VBD charged/
                                                                         (also forming the case name), 1994 is the year of publication of the
VVN with/IN the/DT offence/NN of/IN murder/NN of/IN Yohane/NP
                                                                         Malawi Law Reports, 288 is the case number and 307 is the location.
Makiyi/NP contrary/NN to/TO section/VV 209/CD of/IN the/DT Pe-
                                                                         Neutral citations were introduced in the UK in 2001 and are used
nal/NP Code/NP. In this example, "section" is not capitalised, but it
                                                                         by MalawiLII. For example, on MalawiLII the case:
is tagged as a verb possibly because of the presence of ‘to’ which
usually precedes the infinitive form of a verb. The shape NP-NP           Dalikeni and Others v The Republic (MSCA Criminal Appeal Case
is the most common for 2-grams in our text, and may correspond                                     No. 6 of 2016)
for example to names of people or places, but also to legal terms        is numbered as: Dalikeni and Others v The Republic [2019] MWSC
such as Appellant Andrew, Judge Mwase, legal bodies such as, High        8 where MWSC stands for Malawi Supreme Court and this is the
Court, or Detective Sergeant, or names of laws, e.g., Drugs Act. It is   eighth case registered on MalawiLII under this court. An example
therefore important to have a way of distinguishing these legal          of unreported case is: Republic vs Mpinganjira Bagala HC/PR confir-
terms from the rest of the text to enable a more accurate tagging.       mation case no. 24 of 2011 (unreported 11 July 2013) where HC/PR
Using a list of relevant legal keywords and their use in context, may    stands for High Court Principal Registry.
help with improving the POS tagging for legal text. We hope to              The presence of names of people or organisation means that
look at evaluating legal-specific POS tagging methods in a future        grammar rules or regular expressions cannot work on their own,
research using the MWCC.                                                 and could be combined with lookup and some form of supervised
    The names of people, places and organisations which are particu-     learning. [25] used a supervised statistical models to extract stan-
lar to Malawi are not easily recognised by existing language models.     dardised case citations of the type ’[1994] MLR 288’ from a selection
NLLP @ KDD 2020, August 24th, San Diego, US                                                                                     Amelia V. Taylor


                                                                        these entities (e.g., the reference part merged with the law name)
Figure 3: Example of Case Citation formatted in Bold - con-             into larger ones and eliminated duplicates.
taining also a partial citation which needs resolution.                    Most of the citations that are recognised by the standard SpaCy
                                                                        NER are of the type: Section [number]. However, SpaCy recognition
                                                                        depends on a uniform use of punctuation like spaces and full stops.
                                                                        So for example, if there are extra spaces, e.g, Section 214 (a) instead
                                                                        of Section 214(a), the entity will not be always recognised. Also
                                                                        entities of the type Sections 339 and 340 will also not be consistently
                                                                        recognised.
                                                                           References to laws of England or laws that are typically found in
                                                                        other countries such as Data Protection Act, Official Secrets Act are
                                                                        recognised as these were present in the model. However, names of
                                                                        laws more particular to Malawi were not always recognised. Table
                                                                        6 of Appendix D shows examples of law citations extracted using
                                                                        SpaCy and a comparison between the use of the lg vs sm SpaCy
                                                                        models: some entities which were found using the small model, sm,
                                                                        were lost when using lg, but overall, the use of larger model did
                                                                        result in a more accurate name identification of the law cited.
of 250 Pakistani court judgments. Their algorithms relied on train-        Table 7 of Appendix D shows the citations we were able to
ing data in which case citations were manually tagged using the         identify using in addition to the standard spaCy NER and then an
Inside-Out-Beginning notation. In a much larger project at Thom-        enhanced method using both an Entity Ruler and a Phrase Matcher.
son Legal and Regulatory [17], a ’citator’ database was available       The use of the Phrase Matcher allowed us to extract names of laws
(containing a list of all available names of cases) and the task was    which are specific to Malawi. With this combination, we managed
to resolve the citations found into the citator. A Support Vector Ma-   to find almost all the citations within the text. The phrase matcher
chine (SVM) was used to improve the accuracy of the entity (name        was used to locate the complete names of laws referred to in the
of cases) resolution. SVM were used also for entity resolution in [9]   citations. For example, for the judgments of year 2010, spaCy NER
to match names of judge/attorneys and names of legal firms from         managed to extract 507 valid citations (some incomplete). Using the
text files with Westlaw records of attorney and legal firm files.       enhanced process we extracted in total 1,162 which are citations
   We think that, the extraction of case citations could, in some       (e.g., Section 224 A) and names of laws (e.g., Penal Code). When
cases, be done directly from the scanned images, as most judges use     merged into full citations (e.g., Section 224 A of the Penal Code),
italics of bold font when writing such citations. Then, a supervised    we obtained a total of 611 citations. For the whole corpus, spaCy
algorithm that works on image data could be practical. However          extracted 7,784 law citations out of a total of 18,929 obtained by the
as shown in Figure 3, the convention used in the documents of           enhanced method. Overall, we extracted 10,390 law citations from
our corpus is that only the case name is formatted differently not      our corpus. Thus, this process of extracting law citations worked
including the citation component. Some citations are partial, as        reasonably well and can be used in constructing a training set of
’Kachere and Nseula’ shown in the image, and need to be resolved        annotations for better results.
in context. In the next section we describe our experiments in             The case and law citations are stored in separate TEI files, an
extracting law citations.                                               annotation file for each judgment file containing the paragraph, the
                                                                        exact position inside a paragraph, the text of the annotation and its
5    EXPERIMENTS WITH SPACY                                             type. The position of the annotations within a paragraph can also
Our corpus served as an excellent data set to test extracting law and   be used to resolve incomplete citations or anaphors. Some of the
case citations and to generate test data for a supervised approach.     citations are incomplete and do not include the names of the law.
SpaCy (https:\spacy.io/) is a Python library using state of the art     For example the reference section 235 (a) appears several times in
neural networks for tagging, parsing and entity recognition. The        paragraphs 2 and 3, some occurrences do not contain the name of
Named Entity Recogniser in spaCy already has an entity for "LAW".       the law. The context of the judgment and the classification of the
For the English language, spaCy uses three models of varying sizes,     laws can help in the topic identification, e.g., section 235(a) of the
small (sm), medium (md) and large (lg) trained using Convolutional      Penal code covers issues of causing grievous harm.
Neural Networks on OneNotes 5.0 data set. The accuracy of the
spaCy NER was reported to be over 80% for both precision and
recall.                                                                 6   CONCLUSION
   Our approach was as follows: we first used the standard spaCy        We described the process of creating a corpus of criminal cases
NER to extract LAW entities, then we added an Entity Ruler to           issued by Malawi courts. We reflected on the challenges and op-
extract additional LAW entities. For example the pattern in Figure 4    portunities in semantically enhancing this text and the need for
of Appendix D matches references to sections which use two-level        an intelligent pipeline that processes the text at all stages - some
numbering, such as Section 4 (a) or s. 4 (2) or section 42(2) (f). We   of the semantic enhancement can be done on raw images as we
used a Phrase Matcher based on a database of names of laws and          discussed for case citations. We would like to use our annotations
statues in Malawi to extract LAWNAMES entities. We then merged          and corpus for further training and classification.
MWCC: A Corpus of Malawi Criminal Cases                                                                                               NLLP @ KDD 2020, August 24th, San Diego, US


REFERENCES                                                                                      [24] Peter E. Raper. 2017. Indigenous common names and toponyms in Southern Africa.
 [1] Mikhail Arkhipov, Maria Trofimova, Yuri Kuratov, and Alexey Sorokin. 2019.                      Names 65, 4 (2017), 194–203. https://doi.org/10.1080/00277738.2017.1369742
     Tuning Multilingual Transformers for Language-Specific Named Entity Recogni-               [25] Shahmin Sharafat, Zara Nasar, and Syed Waqar Jaffry. 2019. Data mining for
     tion. Association for Computational Linguistics (ACL), 89–93. https://doi.org/10.               smart legal systems. Computers & Electrical Engineering 78 (sep 2019), 328–342.
     18653/v1/w19-3712                                                                               https://doi.org/10.1016/J.COMPELECENG.2019.07.017
 [2] Atkins, Sue and Clear, Jeremy and Ostler, Nicholas. 1992. Corpus design criteria.          [26] Friedemann Vogel, Hanjo Hamann, and Isabelle Gauer. 2018. Computer-Assisted
     Literary and Linguistic Computing 7, 1 (1992). https://doi.org/10.1093/llc/7.1.1                Legal Linguistics: Corpus Analysis as a New Tool for Legal Studies. Law and
 [3] Sahib Singh Budhiraja and Vijay Mago. 2020. A supervised learning approach                      Social Inquiry 43, 4 (2018). https://doi.org/10.1111/lsi.12305
     for heading detection. Expert Systems (2020). https://doi.org/10.1111/exsy.12520           [27] Chunyu Xia, Tieke He, Wenlong Li, Zemin Qin, and Zhipeng Zou. 2019. Similarity
 [4] María G. Buey, Angel Luis Garrido, Carlos Bobed, and Sergio Ilarri. 2016. The                   Analysis of Law Documents Based on Word2vec. In Proceedings - Companion of
     AIS project: Boosting information extraction from legal documents by using                      the 19th IEEE International Conference on Software Quality, Reliability and Security,
     ontologies. In ICAART 2016 - Proceedings of the 8th International Conference on                 QRS-C 2019. https://doi.org/10.1109/QRS-C.2019.00072
     Agents and Artificial Intelligence, Vol. 2. https://doi.org/10.5220/0005757204380445
 [5] Nuria Casellas. 2011. Semantic Enhancement of Legal Information. . . Are We Up
     for the Challenge? VoxPopuLII (2011).
 [6] Winner Dominic Chawinga, Chaupe, Sellina Khumbo Kapondera,
                                                                                                A      SOME DEFINITIONS
     George Theodore Chipeta, Felix Majawa, and Chimango Nyasulu. 2020;. Towards                Markup. The markup adds what is usually called, external informa-
     e–judicial services in Malawi: Implications for justice delivery. 86:e12121 (2020;),
     1–15. https://onlinelibrary.wiley.com/doi/epdf/10.1002/isd2.12121
                                                                                                tion, meaning information about the text. Legal markup for court
 [7] Min Yuh Day and Chao Yu Chen. 2018. Artificial intelligence for automatic                  judgments: case name, case number, court of hearing, date of case
     text summarization. In Proceedings - 2018 IEEE 19th International Conference on            registration, date of judgment, judge, legal parties such as appellant
     Information Reuse and Integration for Data Science, IRI 2018. Institute of Electrical
     and Electronics Engineers Inc., 478–484. https://doi.org/10.1109/IRI.2018.00076            and respondents, lawyers, court clerks.
 [8] Emile de Maat, Radboud Winkels, and Tom van Engers. 2006. Automated De-                    Simple Structural Annotation. The word structure is used to
     tection of Reference Structures in Law. In Legal Knowledge and Information                 mean a particular general arrangement that is present in most texts.
     Systems. Jurix 2006: The Nineteenth Annual Conference (Frontiers in Artificial
     Intelligence and Applications), Tom M van Engers (Ed.), Vol. 152. IOS Press, 41–50.        The simplest arrangement can be one in which the text is arranged
     http://www.leibnizcenter.org/docs/demaat/DeMaat-Jurix2006.pdf                              in paragraphs, or a text may be arranged in chapters or sections, or
 [9] Christopher Dozier, Ravikumar Kondadadi, Marc Light, Arun Vachher, Sriharsha
     Veeramachaneni, and Ramdev Wudali. 2010. Named entity recognition and
                                                                                                even more generally, as having the three main parts of introduction,
     resolution in legal text. In Lecture Notes in Computer Science (including subseries        a body and a conclusion. These structural components follow a
     Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 6036   tree-like hierarchy.
     LNAI. https://doi.org/10.1007/978-3-642-12837-0_2
[10] R.R. Favretti, F. Tamburini, and E. Martelli. 2007. Words from Bononia Legal               Complex Structural Annotation. In this sense, structure is de-
     Corpus. International Journal of Corpus Linguistics 6, 1 (2007). https://doi.org/          pendent on the nature of the text. For example, a case judgment
     10.1075/ijcl.6.3.03ros                                                                     typically has portions of text in which the facts of the case are
[11] John Garofalakis, Konstantinos Plessas, Athanasios Plessas, and Panoraia
     Spiliopoulou. 2019. Modelling Legal Documents for Their Exploitation as                    presented, followed by proceedings or the history of the case, e.g.,
     Open Data. In Lecture Notes in Business Information Processing, Vol. 353. https:           previous rulings, a discussion of the relevant points of law and the
     //doi.org/10.1007/978-3-030-20485-3_3
[12] Patrizia GIAMPIERI. 2019. the Bolc for Legal Translations: a Trial Lesson. Com-
                                                                                                a conclusion for the case. Structure may also mean rhetorical styles
     parative Legilinguistics 39 (dec 2019), 21–46. https://doi.org/10.14746/cl.2019.39.2       which are used in some part text.
[13] CHENG GONG, JIUYANG TANG, SHENGWEI ZHOU, ZEPENG HAO, and JUN                               Legal Annotations. The annotation in this case refers to locating
     WANG. 2019. Chinese Named Entity Recognition with Bert. DEStech Transactions
     on Computer Science and Engineering cisnrc (2019). https://doi.org/10.12783/               specific pieces of text. This can be specific words, or phrases. Usu-
     dtcse/cisnrc2019/33299                                                                     ally the pieces of interest appear next to each other in the text, but
[14] Claire Grover, Ben Hachey, and Ian Hughson. 2004. The HOLJ Corpus. Supporting              sometimes they do not. In the case of legal text, one is interested in
     Summarisation of Legal Texts. COLING 2004 5th International Workshop on
     Linguistically Interpreted Corpora (2004).                                                 (a) legal terminology; (b) citations to laws and statues; (c) citations of
[15] Ben Hachey and Claire Grover. 2004. A rhetorical status classifier for legal text          other cases.
     summarisation. In In Proceedings of the ACL-2004 Text Summarization Branches
     Out Workshop.
                                                                                                Legal Resolution. Annotations with case citations or law citations
[16] Chu Ren Huang and Yao Yao. 2015. Corpus Linguistics. In International Encyclo-             need to be standardised so that documents can be hyperlinked.
     pedia of the Social & Behavioral Sciences: Second Edition. Elsevier Inc., 949–953.         Legal Classification. This usually refers to a semantic arrange-
     https://doi.org/10.1016/B978-0-08-097086-8.52004-2
[17] Peter Jackson, Khalid Al-Kofahi, Alex Tyrrell, and Arun Vachher. 2003. Informa-            ment of the text into a predefined list of categories according to
     tion extraction from case law and retrieval of prior cases. In Artificial Intelligence,    a pre-established criteria. For example, court judgments can be
     Vol. 150. https://doi.org/10.1016/S0004-3702(03)00106-1                                    classified according to a court taxonomy, e.g., e.g., civil cases versus
[18] Binart Kachule and Amelia Taylor. 2018. Understanding the Factors affecting the
     Utilisation of the Case Management System of the Malawi Judiciary Conference:              criminal cases vs. commercial cases. Some classification criteria are
     EGPA 2018, EGPA study group XVIII on justice and court administrationAt:                   not linked to a taxonomy, e.g., one can classify court judgments
     Lausanne, Switzerland.
[19] Marios Koniaris, George Papastefanatos, and Ioannis Anagnostopoulos. 2018.
                                                                                                based on the type of crime it mostly deals with say theft versus
     Solon: A holistic approach for modelling, managing and mining legal sources.               homicide.
     Algorithms 11, 12 (dec 2018). https://doi.org/10.3390/a11120196                            Topic Extraction. Topic extraction attempts to discover the most
[20] Marios Koniaris, George Papastefanatos, and Yannis Vassiliou. 2016. Towards
     automatic structuring and semantic indexing of legal documents. In ACM In-                 important or relevant keywords in documents. so for example, one
     ternational Conference Proceeding Series. Association for Computing Machinery.             would use this to check if the text at hand contains health advice or
     https://doi.org/10.1145/3003733.3003801                                                    a football match commentary. It is common to use topic extraction
[21] Paola Mariani and Costanza Badii. 2005. Methods and techniques for building a
     digital historic-law dictionary. In Proceedings of the International Conference on         in order to classify documents.
     Artificial Intelligence and Law. 230–231. https://doi.org/10.1145/1165485.1165523          (Un)Structured Legal Text Legal text is by nature quite well or-
[22] James C Phillips and Jesse Egbert. 2017. Advancing Law and Corpus Linguistics:
     Importing Principles and Practices from Survey and Content-Analysis Method-
                                                                                                ganise internally, however, by structured legal text we mean text
     ologies to Improve Corpus Design and Analysis. Brigham Young University Law                that contains some or all of the above. Unstructured legal text are
     Review 2017, 6 (2017).                                                                     doc, pdf, scanned images of such documents that apart from being
[23] Gianluca Pontrandolfo. 2012. Legal Corpora: an overview.
                                                                                                stored electronically, do not contain any of the above.
NLLP @ KDD 2020, August 24th, San Diego, US                                                                             Amelia V. Taylor


B    CORPUS FILES EXAMPLES                                          Table 4: Keywords for extracting legal parties generated
                                                                    from the heading of judgments

<?xml version="1.0"?>                                                    Modifiers     Legal Functions       Case Parties
<TEI.2 lang="en" n="2010_17" id="judg_2010_17">
                                                                         Chief         Reporter              Appellant
....
                                                                         Senior        Advocate              Respondent
<titleStmt><title type="full">
                                                                         Principal     Interpreter           Applicant
<title type="main">Elizabeth Bonomali Vs The State</title>               Acting        Magistrate            Accused
<title type="sub">Criminal Appeal Case No 7 of 2010</title>              Legal Aid     Justice               Defendant
</title></titleStmt>                                                     Deputy        Prosecutor            State
....                                                                     Resident      Clerk                 Convict
<catRef target="#courtofhearing">                                        Principal     Recording Officer     Republic
<keywords>                                                               Official      Judge                 Plaintiff
<list type="courts">                                                     Deputy        Lawyer                Coram
<item>IN THE HIGH COURT OF MALAWI</item>                                 Court                               Principal Witness
<item>PRINCIPAL REGISTRY</item>                                          Honourable                          Republic
                                                                         Acting                              Counsel
</list>
</keywords>
....
<front>
                                                                    Table 5: Final Merged Entities for Judgment 17 of 2010 of
<list type="caseinfo">                                              MWCC
<item>CRIMINAL APPEAL CASE NO 7 OF 2010</item>
</list>
                                                                        paragNo Merged Entity                Start      End
<list type="parties">
                                                                        2       Section 214 (a) of the Pe-   117        150
<item>ELIZABETH BONOMALI</item>                                                 nal Code
<item>THE REPUBLIC</item>                                               5       Sections 339 and 340 of      897        961
</list>                                                                         the Criminal Procedure
<list type="coram">                                                             and Evidence Code
<item>HON JUSTICE J M CHIRWA</item>                                     7       Sections 339 and 340 of      1983       2047
<item>Mr Lemucha of Counsel for the State</item>                                the Criminal Procedure
<item>Chipembere of Counsel for the Accused</item>                              and Evidence Code
<item>N Nyirenda Official Interpreter</item>                            8       Sections 339 and 340 of      88         152
</list>                                                                         the Criminal Procedure
</front>                                                                        and Evidence Code
<body>                                                                  10      Sections 339 and 340 of      183        247
                                                                                the Criminal Procedure
<p n="2">The Appellant, Elizabethe Bonomali, was convicted
                                                                                and Evidence Code
      after a full trial of the offence of unlawful wounding
                                                                        12      Section 254 of the Penal     1034       1063
      contrary to Section 214 (a) of the Penal Code and sentenced               Code
       to 12 months' imprisonment with hard labour by the First         13      Sections 339 and 340 of      30         94
      Grade</p>                                                                 the Criminal Procedure
<p n="3"> Magistrate's court at Dalton Road, Limbe, on the 25th                 and Evidence Code
      day of February, 2010. She has appealed to this Court             14      Section 339 (1):             0          16
      against both the conviction and sentence.</p>                     15      section 283 of the Penal     477        506
<p n="4">When the Appeal came up for hearing on the 26th day                    Code
      of March 2010 the Appellant indicated that she had                15      Section 340 (1 ):            517        534
      abandoned her appeal against the conviction and that her          15      Section 339                  792        815
      complaint remained against the sentence only. I thus leave        16      sections 15 and 16           22         40
      the conviction endorsed by the Learned Magistrate                 16      section 283 of the Penal     287        316
      unfettered with.</p>                                                      Code
.....                                                                   17      Section 339 of the           226        244
                                                                        18      Section 340 of the Crim-     68         123
</body>
                                                                                inal Procedure and Evi-
                                                                                dence Code
MWCC: A Corpus of Malawi Criminal Cases                                                                NLLP @ KDD 2020, August 24th, San Diego, US


C    TYPES OF LAW CITATIONS                                                Table 6: Example of improvements in precision but not re-
                                                                           call using the lg versus the sm scaCy model.
    • References containing only the name of the law/statue
      The following offences involving dishonesty in the Penal Code
      are based on circumstances.... or ...the Control of Goods Act         Model   Parag    Pos. In Parag    Entity
      derives its procedure in criminal matters from the Criminal           sm      2        181              Penal Code
      Procedures and Evidence Code...                                       sm      46       86               section 187(1
    • References containing labels and names of the law                     lg      51       112              section 331
      Section 11 (2) of the Supreme Court of Appeal Act. or Section         lg/sm   51       127              the Penal Code
      283 of the Penal Code.                                                lg      73       75               Bill of
    • References containing labels and abbreviations, or additional         lg/sm   82       33               section 328
      names in which a law is known (usually appears in brackets)           sm      86       313              Act
      section 6 of the Control of Goods (Import and Export)                 sm      86       396              Act
      section 4 (d) of Part II of the Schedule to Bail (Guidelines) Act     lg      86       157              an Act of Parliament
      s. 149 of CP&EC                                                       lg      86       228              an Act of Parliament
      section 17(d) and 42 of the Liquid Fuel and Gas (Production           lg/sm   86       29               Constitution
      and Supply) Act                                                       lg/sm   86       106              Constitution
    • References containing labels, names or abbreviations, and             lg      86       88               section 37
      the year or date applicable to the law                                sm      86       376              section 4(1
      review of section 15 of the Code: it is commonplace that the          lg/sm   86       320              the Official Secrets Act
      CP&EC was amended in 2010                                             lg      90       383              an Act of Parliament
      section 340(3) of the Proceeds of Crime Act 2002 (POCA)               lg/sm   93       115              Freedom of Information Act 2000
    • References to laws that are pertaining to other countries             sm      93       151              the Data Protection Act
      (e.g., UK laws mentioned in Malawi court judgments)                   lg      93       151              the Data Protection Act 1998
      section 145 of the New Zealand Crimes Act of 1961                     lg/sm   95       42               Section 356
      offences against the Person Act, 1861 as held in R v Dica [2004]
      2 Cr. App. R. 28
    • references by means of anaphors spanning more than one
      line, or sentence, or paragraph.                                     Table 7: Number of LAW Entities retrieved using the stan-
      Section 12 of the Act...                                             dard SpaCy model and by an enhanced method (+ Enti-
      section of the same constitution ...                                 tyRuler and PhraseMatcher).
      ...in the Penal code...theft from a person (section 282(a)); theft
      from a dwelling house (section 282 (b))....                           Year     SpaCy    Enhanced       Merged Entities      Spacy Recall
    • References containing more than one label, number, e.g.,
      Section 2, 3 and 5 of ...                                             2010     507      1,162          611                  44%
                                                                            2011     554      1,310          635                  42%
                                                                            2012     153      400            184                  38%
                                                                            2013     3,406    8,432          4,769                40%
D    RESULTS OF THE SPACY EXPERIMENTS                                       2014     621      1,640          863                  38%
                                                                            2015     1,044    2,414          1,378                43%
                                                                            2016     469      1,055          589                  44%
                                                                            2017     236      616            295                  38%
                                                                            2018     597      1,374          772                  43%
                                                                            2019     197      526            294                  37%
                                                                            TOTAL    7,784    18,929         10,390               41%

Figure 4: Example of pattern for extracting section citations
for use with spaCy Entity Ruler
patterns = [{
"label":"SECLAW",
"pattern":[
   {"TEXT": {"REGEX": "^[Ss](ec\.?|ection|ections)$"}},
   {"IS_DIGIT":True,'OP':'?'},
   {"ORTH": "(",'OP':'?'},{},{"ORTH": ")",'OP':'?'},
   {"ORTH": "(",'OP':'?'},{},{"ORTH": ")",'OP':'?'},
   {"LOWER":"of",'OP':'?'}]
}]