MWCC: A Corpus of Malawi Criminal Cases Amelia V. Taylor ataylor@poly.ac.mw University of Malawi, The Polytechnic and tNyasa Ltd, Data Labs Blantyre, Malawi ABSTRACT (II) that provides a useful classification of the judgments for legal We describe the creation of a corpus of criminal court judgments research. issued by the Malawian courts. We highlight opportunities and In this paper we describe the creation of the corpus used in challenges in machine understanding of this text. these two tasks and our results regarding the first. The paper is structured as follows. In Section 2 we review relevant literature. KEYWORDS In Section 3 we describe the steps we took in creating the Malawi Criminal Cases Corpus (MWCC) and discuss adding markup to Legal corpus, Entity recognition, Text annotation and markup the files. In Section 4 we describe the types of annotations of law ACM Reference Format: and case citations we added to the corpus. In Section 5, by means Amelia V. Taylor. 2020. MWCC: A Corpus of Malawi Criminal Cases. In of examples from our corpus we illustrate challenges in machine Proceedings of the 2020 Natural Legal Language Processing (NLLP) Workshop, understanding of legal text. We conclude in Section 6 . 24 August 2020, San Diego, US. ACM, New York, NY, USA, 9 pages. 2 LITERATURE REVIEW 1 INTRODUCTION A corpus is ’a collection of examples of language in use that are This article presents the creation of a corpus of criminal case judg- selected and compiled in a principled way’ [16]. A list of corpora ments issued by appellate courts in Malawi and our experiments in containing legal text is given in [23, 26]. These vary in size and preparing this text to be used with machine learning algorithms. genre coverage 3 . A small number of the corpora listed specialise on In Malawi, legal researchers face significant challenges in access- criminal judgments. However, these are not available or maintained ing and searching for relevant information. The Malawi Judiciary regularly and seem to have been developed to serve a specific re- Development program that ran over the years 2003-2008, found that search objective in mind only: the HOLJ House of Lords Judgments “an inadequate provision of fundamental legal resources, such as Corpus is a small, containing 188 texts, subset of the collection of books, case reports, statute books and gazettes, greatly constrains the House of Lords Judgments and was used for summarisation and the performance of the judiciary in its administration of justice”. In rhetorical structural annotation[14, 15]. The Corpus de Sentencias 2013, the Malawi Judiciary, with funding from the European Union Penales 2005 - 2010 was used to study ’legal phraseology’ [26]. introduced a case management system use in the High Court and There are also clusters of research around some corpora, e.g., the Director for Public Prosecution [6, 18]. This new system has corpora of Italian legal text have been used in generating dictionar- improved the case registration process but suffers from bottlenecks ies of legal terms [21], in analysing their usage [10], and to assist in processes and document logging; few case documents and final in translations [12]. Similarly, a corpus of Dutch legislation was judgments are stored on the system and most of these contain no annotated using the Metalex XML scheme 4 and then enhanced meta-data [18]. with meta-data regarding the document structure, external descrip- In the last few years, MalawiLII1 provides online access to some tive data and law citations meta-data [8]. Corpora of Geek Tax of the court judgments, laws and statues in Malawi 2 . MalawiLII Legislation [19, 20] and Greek Supreme Court decisions [11] were does not support a system of citation that makes it possible to link enhanced with XML structural mark-up and annotations. These statutory law, case law and secondary law or to search by “legal projects did not use machine learning but made use of linguistic terms” and their specific interpretations. features of the text, regular expressions or syntactic parsers and In view of these challenges, we started the development of an grammars for data extraction. automatic tool that (I) provides meta-data for criminal court judg- The process of machine understanding of legal text involves a ments on MalawiLII by demarcating their text into components great deal of semantic enhancement, in order to make explicit or such as headers, introduction, body and conclusion and extracting machine understandable ’the flexibility, intuition and capabilities meta-data such as names of judges, dates, court of hearing; and of the conceptual structures of the human languages’ in readiness 1 malawilii.org for Web 4.0 [5]. The reality is that most legal text that is being 2 Albeit not complete or up to date, this is an improved successor to the listing of made available online at present is in an unstructured form, i.e., judgments that were initially done on the SNDP webpage, which listed judgments, court documents, some of the laws and Constitution of Malawi http:\www.sdnp.org. 3 Some contain a wide range of types of text, e.g., academic journals, textbooks, con- mw/index-archived.php tracts, opinions, legislation, e.g., the American Law Corpus some contain only law reports but cover several types of cases from administrative to criminal cases, e.g., the Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons Corpus of US Supreme Court Opinions, some contain legal text of historical importance License Attribution 4.0 International (CC BY 4.0). for example, the Old Bailey corpus is a historical corpus covering 197,745 texts over NLLP @ KDD 2020, August 24th, San Diego, US 1674 - 1913, some are multi-lingual like the ones covering European legislation, e.g., © 2020 Copyright held by the owner/author(s). JRC-Acquis, Bononia Legal Corpus 4 http:\www.metalex.eu/ NLLP @ KDD 2020, August 24th, San Diego, US Amelia V. Taylor has almost no meta-data, no annotations that makes it possible to describe some of our experiments in adding markup and annota- hyperlink it and ’machine understandable’. Hence, current work is tions to the judgments. By means of examples from our corpus, we still largely focussed on taking a collection on unstructured text and reflect on the importance of cooperation between linguistic and adding markup and annotations. As we are dealing with written text, machine learning expertise in putting together legal corpora to there is already an inherent organisation within the text itself. So in solve challenges in machine understanding of the legal text. For this sense, ’adding structure’ means to extract from or externalise reasons of space limitations we placed some important terminology this organisation in the form of markup or annotations on the text. definition in Appendix A. We included a discussion on this terminology in Appendix A. The degree of ’organisation’ in legal text varies a lot, text containing 3 THE MALAWI CRIMINAL CASES CORPUS legislation is said to be more organised than that of court judgments, (MWCC) and notarial contracts [4] are more organised than legislative texts, We followed the guidelines of [22] in creating the corpus. and among legislations, those referring to tax and administrative law [8] are more structured. 3.1 The Target Domain for MWCC Corpus In the case of court judgments, there are differences in how courts within the same country and or courts in different countries The data for MWCC corpus is the criminal case judgments stored structure the text. However, all court judgments share common ele- in electronic doc, pdf and scanned images. These are obtained from ments. They all contain, usually in their introduction, information the High Court Library. The librarian scans the physical judgments such as the courts of hearing, dates and case numbering or docket received from the High Court Registry, page by page, and stores numbers, names of the judges and other legal parties involved. them as pdf files. The physical papers are then catalogued in folders They all follow a certain legal rhetoric in which facts are presented, by year, and some of the scanned judgments are sent to law firms, points of law are discussed and finally the judgment is concluded. judges and other parties which subscribe to this service. These There are also common conventions that are used for citing laws are also uploaded to MalawiLII. The electronic scans have been and other cases. Some of these regularities were used to develop named by the High Court Librarian according to a convention: and test algorithms that employ machine learning techniques to [Case Name] [Case Type] [Case Number] [Case Year]. For example, ’understand’ legal text, e.g., resolution of names of legal parties Lawrence Chibwana Vs The State Criminal Appeal No. 42 of 2010.pdf. such as judges [9], resolution of citations to laws or other cases In some cases the name of the judge is also present in the title. In [17], extracting citations to laws [25], automatic summarization of some cases, the naming of files does not correspond to their content, court judgments [7, 14]. or names of parties have been misspelled. The trend seems to be that researchers collect their own data The names of cases as retrieved from file names can be used to and use that to develop or test algorithms; a particular data set may create a case citator database or if one exists to cross check them never be used in another study. Noting this trend, [22] sets out a against that. To our knowledge the Malawi High Court Library best-practice guide for the collection and analysis of legal corpora does not maintain systematically a case citator database. For legal for linguistic analysis to ensure a certain degree of generality of the researchers, it is important to know which of these cases have research results found when using a custom corpora. Generalisation been reported in official law reports as these receive a special nam- issues may come from the impact that the genre of the text within a ing convention. Identifying a citation is only useful if that can be corpus has for example on the task of assigning meaning to terms, ’resolved’ and matched against an external knowledge source. A e.g., collocates of "breach" across different corpora may belong to manual search for prior cases typically involves formulating a query different definitions/ meaning of the word. (using party names, dates, docket numbers, and courts), retrieving We think that there is a need to set our similar guidelines for the documents from a database of millions of opinions, and iterating the use of legal corpora for machine learning purposes. For machine process until the right cases are found. Challenges in case names learning algorithms, generalisation challenges can be even less resolutions were discussed in [17] where the authors describe the obvious because of the interplay between the impact of the language development of a tool that provide automated assistance to the models used, of the differences in size and type or ’genre’ of the citators of Thomson Legal and Regulatory. In some cases, a cita- training data versus the test data. An experiment that measures text tion cannot be resolved if there is no sufficient data in the context similarity of legal documents [27] showed that a word2vec model or if the judge refers to case documents that are not available or was better than a bag of words model and the size of the training numbered (e.g., references to affidavit documents attached to the data compared to that of the corpus impacts the accuracy of the case). similarity results. However, to explain these results the authors spend very little time describing their data apart from describing it 3.2 The Design and Collection of the MWCC as ’selected larceny cases’. Corpus Putting together a corpus takes significant time and involves a We collected 682 criminal court judgments issued over 2010-2019 diversity of linguistic and computing skills. In building the MWCC by the High Court and the Supreme Court of Malawi. These were corpus we tried to ensure a certain ’separation of the corpus design’ stored as scanned images of physical documents. The files were from research design in order to ensure that other researchers will roughly organised on disk according to the year in which they use our corpus. We present the construction of a corpus of Malawi were issued. The steps we took in the preparation of the text for criminal case judgments from a set of ’unstructured’ text files and the MWCC corpus are: (I) File cataloguing: re-name the files with shorter names,remove special symbols, and maintain a mapping MWCC: A Corpus of Malawi Criminal Cases NLLP @ KDD 2020, August 24th, San Diego, US Another challenge was the frequent use of quotations, where Figure 1: Example of footnotes in court judgments. a judge was discussing points relevant to the case at hand using extracts from law or from relevant cases. Some quotations used block quotes or other quotation marks. Others used indentation, italics or syntactical clues by the use of specific keywords that indicate their presence. It may be beneficial to use extra processing steps (e.g., using Tesseract 6 ) to identify the presence of quotes in the text and to mark these as special parts of the text flow. The electronic files of the corpus are structured into folders, one for each year. Each judgment has three files corresponding to: text file for introduction, text file for body with each paragraph being on one line, TEI XML file with markup and judgment paragraphs. We also have a separate file that maps the names of each file in the corpus with the name of the raw data file. 3.3 MWCC Corpus Statistics for the naming. There was also a need to correct misspellings of We can describe our corpus according to the criteria in [2] as a names of parties present in the title or remove duplicates of files. full text (each text in the corpus is unabridged), synchronic (covers There were also cases in which several cases were scanned together the period 2010 - 2019 and hence there is not a ’noticeable’ change and saved in the same one file, these we had to split. (II) Image over this period in the way language is used or any change in the adjustments: straighten, remove watermarks, remove imperfections vocabulary used), terminological (our text contains both general due to the scanning process; (III) Batch OCR: Run page by page and specific legal terms), monolingual (but containing names of OCR obtaining text corresponding to each line (word by word) in people, organisation, geographical places that are typical of Malawi). the image, saving this in json files which also contain some text The corpus contains 1,572,956 tokens, 1,374,635 words (a word may formatting information, such as distances between lines, and font appear more than once), 63,574 sentences and 22,124 paragraphs sizes; (IV) Text Reconstruction and Corpus Creation: Reconstruct extracted from 682 documents. There are 29,238 unique words, with the text from the files obtained by OCR and create the corpus files a lexical variation of 2.1%. Table 1 shows a breakdown of cases per in text and XML format. We used Python openCV to deal with top 10 judges and Table 2 shows the breakdown of cases per year watermarks and markings on the text; and we wrote a Python and shows sizes of yearly sub-corpora. batch program to split and merge back the images, the ocr.space We used Sketchengine 7 to analyse the corpus in terms of part API 5 for the OCR on the images, then we used custom python code of speech tags, word lists and collocations. Table 3 shows the main to process the json files returned by the OCR API. part of speech frequencies for words that appear at least 5 times The image preparation stage could be improved by using tech- and excluding non-words. These represent 80% of our corpus. The niques for automatically detecting image features which, if known percentage distribution are calculated on the whole corpus. Nouns, in advance, can be useful for improving the quality of the OCR: verbs and prepositions appear quite frequently. We also notice fre- most judgments contain official stamps, some outside the text, some quent use of adjectives; here are the top fifteen adjectives: criminal, on top of the text, most contain signatures of the judges or official other,low,such,first, guilty, same, unreported, reasonable, maximum, clerks. These can be isolated, or removed before the OCR. convict, public, present, appropriate, excessive. Several of these are The most tricky part of the OCR process on these judgments was specific of the legal language. The top ten most frequent nouns the presence of headers, footers and footnotes. The headers usually are all specific to the legal language: court, sentence, case, evidence, contained pagination and/or name of the case contained in the offence, appellant, section, person, court, theft. document. The header could not be always removed automatically Using Sketchengine we could also analyse the language used based on text features, such as font size or distance to the main in our corpus compared to other corpora. In particular, we can body of the text, as in many cases the font was the same and the look at corpora built for the general English language use such as headers were too close to the main text of the judgment as to appear the English Web corpus 2013, an English corpus made up of texts as a normal part of the text. The footnotes also cannot be removed collected from the Internet, containing 15 billion words. Compared automatically because they contain relevant legal information. The to this corpus, in ours, we see a much heavier use of prepositions footnote example in Figure 1 contains several case citations, e.g., and a lesser use of verbs compared to nouns. We can also find those [1994] MLR 288 (HC) at 307. This is an incomplete citation where n-grams or multi-words which appear frequently in our corpus and one part, the case name, is in the main judgment text and the case very rarely in the comparison corpus, such as criminal procedure, citation is in the footnote. The ocr.space API extracts all textual hard labour, maximum sentence, theft simpliciter, first offender, ac- information including the footnotes but these are not distinguished cused person, reasonable doubt. Such a comparison can be used to from the rest of the text. Heuristics based on structural information extract features useful for a machine learning classification or topic such as indentation, differences in font sizes, distances from the extraction analysis. main text, could be used to recognise footnotes with some success. 6 https:\github.com/tesseract-ocr/tesseract 5 https:ocr.space 7 https:\sketchengine.eu NLLP @ KDD 2020, August 24th, San Diego, US Amelia V. Taylor Table 1: Malawi Criminal Cases in MWCC by top 10 judges (out of a total of 35 judges) in order of number of judgments Figure 2: Introduction Part for Judgment 1 of 2013 issued.] JUDICIARY IN THE HIGH COURT OF MALAWI Judge Name No. Cases PRINCIPAL REGISTRY CONFIRMATION CASE NO 689 OF 2013 CHIRWA, J. M. 106 Being Criminal Case No. 719 of 2013 from the Second Grade KAMANGA-NYAKAUNDA, D. 65 Magistrate Court Sitting at Chikhwawa KAMWAMBE, M.L. 71 THE REPUBLIC KALEMBERA, S.A. 25 Versus MADISE, D.T.K. 45 MATEYU THOM MBVUNDULA, R. 28 THE HONOURABLE JUSTICE KENYATTA NYIRENDA MWAUNGULU, D.F. 81 Margaret Munthali, Senior State Advocate, for the State NYERENDA, K. 51 Accused person, Absent/unrepresented SIKWESE, R. S. 37 Mrs. D. Mtegha, Official Court Interpreter Percentage of Total (627/682) 92% ORDER IN CONFIRMATION analysis. Concordances are useful in finding out relevant connec- Table 2: Composition of the MWCC by year tions between words (modifiers of specific words) and also to reveal multi-words units, e.g., detecting names of organisations, High Year No. Cases Tokens No. Parag. MAx. Avg. Parag. Len. Court of Malawi, or names of legal functions such as court clerk, 2010 85 162,960 2,096 232 attorney to state council, etc. 2011 72 155,154 2,959 131 A collocation is a sequence or a combination of words that occur 2012 20 54,149 720 189 together more often that what would be expected by chance. The 2013 162 426,584 6,840 200 strength of collocation is measured by the LogDice score (the higher 2014 85 141,115 2,066 96 the code the higher the collocation). Words Collocations can help 2015 122 274,583 3,538 131 understand the usage pattern of key legal terms, e.g., top modifiers 2016 46 106,069 1,273 128 of murder as a verb are brutally, mercilessly, allegedly. These can 2017 27 52,038 810 42 indicate the seriousness of the crime and or the intention. The collo- 2018 42 153,572 1,454 157 cates of crime are consequence, offender, alibi, criminal, circumstance 2019 21 46,732 368 223 and the word ’criminal’ has the strongest collocations with dan- gerous, hardened, unknown, hardcore, habitual. Collocates for key Total 682 1,572,956 22,124 232 legal terms can be used in topic extraction and the classification of judgments. Table 3: Main Parts-of-Speech (Items with frequencies higher than 5.) This represent 80% of our corpus. 3.4 Adding Structural Markup to the MWCC Corpus Part of Speech No. Items (Lemmas) Freq. Distribution We have two formats for the files of the corpus: (a) an all text format and (b) an XML TEI format 8 . All judgments contain a front cover Noun 3,959 393,777 29% with information on the parties, the court of hearing, the dates Verb 1,247 231,126 17% and number of the case, the coram who heard the case (includes Adjectives 953 75,894 6% the judge, attorneys and other judicial clerks). It is possible to Adverbs 448 58,793 4% automatically separate this part from the main body of the judgment. Prepositions 81 223,172 16% In the text only format of the corpus, we keep separate files for the Conjunctions 13 43,975 3% introduction, an example is given in Figure 2, and separate files Pronouns 29 53,546 4% for the paragraphs of the body of the judgment, each paragraph is Numerals 27 7,702 1% stored in one line of text. We based our separation of the introduction from the rest of the body on algorithm that is (a) looking for the presence of specific terms such as ORDER IN CONFIRMATION, RULING and (b) using There is also language that is particular to certain judges, e.g., formatting differences such as distances between lines of text used theft simpliciter is used mainly by judge D F MWAUNGULU. in the introduction versus the rest of the text. While word-lists and lists of keywords give us some useful sta- Subcorpus of Introductions We thus obtained a sub-corpus made tistics about the composition of our corpus, they do not take into up of only introduction parts of the judgments. Out of that, we account the context in which terms occur. When looking at a spe- created a dictionary of legal keywords from all introductions (Table cific sequence of tokens/ words, the context surrounding a keyword is important. Such an analysis is called concordance or collocation 8 https:\tei-c.org/ MWCC: A Corpus of Malawi Criminal Cases NLLP @ KDD 2020, August 24th, San Diego, US 4 Appendix B) which were then used to extract the legal parties The names used in Malawi are of Bantu origin [24] with European involved in a case: such as name of the parties, judge, etc. This influences, hence sometimes parts of names are recognised while external meta-data was then added into the XML version of the others are not. Names of people frequently appear in our text. We corpus as meta-data for each judgment. An example of this meta- will annotate our text with Bantu names of people and places. We data is given in Appendix B. think that the MWCC can be used for building a training set, of While our approach did not involve machine learning, there is typical Bantu names to be used with recent advances in BERT and scope to use our sub-corpus to test supervised learning approaches transformers. For example [1] used the BERT model to recognise to extract this information. In [3], the authors did something similar names of entities in Bulgarian, Czech and Polish and in [13] BERT to us in the sense that they extracted formatting features which was used to recognise Chinese names. were later used in a supervised algorithm for extracting headings from pdf documents. 4 ADDING ANNOTATIONS TO THE MWCC CORPUS 3.5 Chunking and Proper Names Recognition 4.1 Law Citations Chunking poses many challenges. Some judgments are very long There are several types of reference to laws found in our text. For and may contain long paragraphs. Table 2 gives an indication of the example, references containing only the name of the law/statue maximum average length of judgments per year: ranging from 90 to The following offences involving dishonesty in the Penal Code are over 200 tokens. We debated whether to store the text line by line, based on circumstances.... or ...the Control of Goods Act derives its to split it into sentences using punctuation or to group the text in procedure in criminal matters from the Criminal Procedures and the same logical paragraphs as they were in the original images. We Evidence Code. opted for the latter. We wanted to make sure we capture situations There are references containing labels and names of the law in which entities of interest break across lines. For example, in some Section 11 (2) of the Supreme Court of Appeal Act. or Section 283 of case citations, one line may contain the names of the parties and the Penal Code. another line, the court and dates. We used a heuristic based on the There are more complex types such as references by means of distances between lines to re-arrange the text to match the original anaphors spanning more than one line, or sentence, or paragraph. paragraphs. We did not use punctuation to split into sentences Section 12 of the Act... because the text contained many ’entities’ or elements which make section of the same constitution ... use of full-stops, e.g., numbers, references to sections of law. ...in the Penal code...theft from a person (section 282(a)); theft from a We used the POS tagging for extracting parts of our text which dwelling house (section 282 (b)).. was likely to contain references to laws and cases. The English Appendix C gives a more comprehensive list. We annotated each TreeTagger PoS tagset used by Sketchengine struggled with proper judgment with law citations: an example is given in Table 5 of nouns because legal text makes use of capitalisation of many words Appendix B. for legal terms such as laws, e.g., Penal Code, legal parties, e.g., Appellant, or legal functions, e.g., Court Interpreter, references to 4.2 Case Citations laws, e.g., "Section", or names or crimes, e.g., "Manslaughter". These Case citations may refer to cases published in official law reports or were usually tagged as nouns, but at times they were tagged as to unpublished cases, each of these using different styles of citation. proper nouns as in I/PP thus/RB convicted/VVD the/DT accused/VVN A citation from the Malawi Law Report is: of/IN the/DT offence/NN of/IN Manslaughter/NP contrary/NN to/IN Republic v Chizumila and others [1994] MLR 288 (HC) at 307 Section/NP 208/CD of/IN the/DT Penal/NP Code/NP, or even verbs as where Republic v Chizumila and others are the parties involved in Whereas/IN MUSATOPE/NP CHAPOTERA/NP was/VBD charged/ (also forming the case name), 1994 is the year of publication of the VVN with/IN the/DT offence/NN of/IN murder/NN of/IN Yohane/NP Malawi Law Reports, 288 is the case number and 307 is the location. Makiyi/NP contrary/NN to/TO section/VV 209/CD of/IN the/DT Pe- Neutral citations were introduced in the UK in 2001 and are used nal/NP Code/NP. In this example, "section" is not capitalised, but it by MalawiLII. For example, on MalawiLII the case: is tagged as a verb possibly because of the presence of ‘to’ which usually precedes the infinitive form of a verb. The shape NP-NP Dalikeni and Others v The Republic (MSCA Criminal Appeal Case is the most common for 2-grams in our text, and may correspond No. 6 of 2016) for example to names of people or places, but also to legal terms is numbered as: Dalikeni and Others v The Republic [2019] MWSC such as Appellant Andrew, Judge Mwase, legal bodies such as, High 8 where MWSC stands for Malawi Supreme Court and this is the Court, or Detective Sergeant, or names of laws, e.g., Drugs Act. It is eighth case registered on MalawiLII under this court. An example therefore important to have a way of distinguishing these legal of unreported case is: Republic vs Mpinganjira Bagala HC/PR confir- terms from the rest of the text to enable a more accurate tagging. mation case no. 24 of 2011 (unreported 11 July 2013) where HC/PR Using a list of relevant legal keywords and their use in context, may stands for High Court Principal Registry. help with improving the POS tagging for legal text. We hope to The presence of names of people or organisation means that look at evaluating legal-specific POS tagging methods in a future grammar rules or regular expressions cannot work on their own, research using the MWCC. and could be combined with lookup and some form of supervised The names of people, places and organisations which are particu- learning. [25] used a supervised statistical models to extract stan- lar to Malawi are not easily recognised by existing language models. dardised case citations of the type ’[1994] MLR 288’ from a selection NLLP @ KDD 2020, August 24th, San Diego, US Amelia V. Taylor these entities (e.g., the reference part merged with the law name) Figure 3: Example of Case Citation formatted in Bold - con- into larger ones and eliminated duplicates. taining also a partial citation which needs resolution. Most of the citations that are recognised by the standard SpaCy NER are of the type: Section [number]. However, SpaCy recognition depends on a uniform use of punctuation like spaces and full stops. So for example, if there are extra spaces, e.g, Section 214 (a) instead of Section 214(a), the entity will not be always recognised. Also entities of the type Sections 339 and 340 will also not be consistently recognised. References to laws of England or laws that are typically found in other countries such as Data Protection Act, Official Secrets Act are recognised as these were present in the model. However, names of laws more particular to Malawi were not always recognised. Table 6 of Appendix D shows examples of law citations extracted using SpaCy and a comparison between the use of the lg vs sm SpaCy models: some entities which were found using the small model, sm, were lost when using lg, but overall, the use of larger model did result in a more accurate name identification of the law cited. of 250 Pakistani court judgments. Their algorithms relied on train- Table 7 of Appendix D shows the citations we were able to ing data in which case citations were manually tagged using the identify using in addition to the standard spaCy NER and then an Inside-Out-Beginning notation. In a much larger project at Thom- enhanced method using both an Entity Ruler and a Phrase Matcher. son Legal and Regulatory [17], a ’citator’ database was available The use of the Phrase Matcher allowed us to extract names of laws (containing a list of all available names of cases) and the task was which are specific to Malawi. With this combination, we managed to resolve the citations found into the citator. A Support Vector Ma- to find almost all the citations within the text. The phrase matcher chine (SVM) was used to improve the accuracy of the entity (name was used to locate the complete names of laws referred to in the of cases) resolution. SVM were used also for entity resolution in [9] citations. For example, for the judgments of year 2010, spaCy NER to match names of judge/attorneys and names of legal firms from managed to extract 507 valid citations (some incomplete). Using the text files with Westlaw records of attorney and legal firm files. enhanced process we extracted in total 1,162 which are citations We think that, the extraction of case citations could, in some (e.g., Section 224 A) and names of laws (e.g., Penal Code). When cases, be done directly from the scanned images, as most judges use merged into full citations (e.g., Section 224 A of the Penal Code), italics of bold font when writing such citations. Then, a supervised we obtained a total of 611 citations. For the whole corpus, spaCy algorithm that works on image data could be practical. However extracted 7,784 law citations out of a total of 18,929 obtained by the as shown in Figure 3, the convention used in the documents of enhanced method. Overall, we extracted 10,390 law citations from our corpus is that only the case name is formatted differently not our corpus. Thus, this process of extracting law citations worked including the citation component. Some citations are partial, as reasonably well and can be used in constructing a training set of ’Kachere and Nseula’ shown in the image, and need to be resolved annotations for better results. in context. In the next section we describe our experiments in The case and law citations are stored in separate TEI files, an extracting law citations. annotation file for each judgment file containing the paragraph, the exact position inside a paragraph, the text of the annotation and its 5 EXPERIMENTS WITH SPACY type. The position of the annotations within a paragraph can also Our corpus served as an excellent data set to test extracting law and be used to resolve incomplete citations or anaphors. Some of the case citations and to generate test data for a supervised approach. citations are incomplete and do not include the names of the law. SpaCy (https:\spacy.io/) is a Python library using state of the art For example the reference section 235 (a) appears several times in neural networks for tagging, parsing and entity recognition. The paragraphs 2 and 3, some occurrences do not contain the name of Named Entity Recogniser in spaCy already has an entity for "LAW". the law. The context of the judgment and the classification of the For the English language, spaCy uses three models of varying sizes, laws can help in the topic identification, e.g., section 235(a) of the small (sm), medium (md) and large (lg) trained using Convolutional Penal code covers issues of causing grievous harm. Neural Networks on OneNotes 5.0 data set. The accuracy of the spaCy NER was reported to be over 80% for both precision and recall. 6 CONCLUSION Our approach was as follows: we first used the standard spaCy We described the process of creating a corpus of criminal cases NER to extract LAW entities, then we added an Entity Ruler to issued by Malawi courts. We reflected on the challenges and op- extract additional LAW entities. For example the pattern in Figure 4 portunities in semantically enhancing this text and the need for of Appendix D matches references to sections which use two-level an intelligent pipeline that processes the text at all stages - some numbering, such as Section 4 (a) or s. 4 (2) or section 42(2) (f). We of the semantic enhancement can be done on raw images as we used a Phrase Matcher based on a database of names of laws and discussed for case citations. We would like to use our annotations statues in Malawi to extract LAWNAMES entities. We then merged and corpus for further training and classification. MWCC: A Corpus of Malawi Criminal Cases NLLP @ KDD 2020, August 24th, San Diego, US REFERENCES [24] Peter E. Raper. 2017. Indigenous common names and toponyms in Southern Africa. [1] Mikhail Arkhipov, Maria Trofimova, Yuri Kuratov, and Alexey Sorokin. 2019. Names 65, 4 (2017), 194–203. https://doi.org/10.1080/00277738.2017.1369742 Tuning Multilingual Transformers for Language-Specific Named Entity Recogni- [25] Shahmin Sharafat, Zara Nasar, and Syed Waqar Jaffry. 2019. Data mining for tion. Association for Computational Linguistics (ACL), 89–93. https://doi.org/10. smart legal systems. Computers & Electrical Engineering 78 (sep 2019), 328–342. 18653/v1/w19-3712 https://doi.org/10.1016/J.COMPELECENG.2019.07.017 [2] Atkins, Sue and Clear, Jeremy and Ostler, Nicholas. 1992. Corpus design criteria. [26] Friedemann Vogel, Hanjo Hamann, and Isabelle Gauer. 2018. Computer-Assisted Literary and Linguistic Computing 7, 1 (1992). https://doi.org/10.1093/llc/7.1.1 Legal Linguistics: Corpus Analysis as a New Tool for Legal Studies. Law and [3] Sahib Singh Budhiraja and Vijay Mago. 2020. A supervised learning approach Social Inquiry 43, 4 (2018). https://doi.org/10.1111/lsi.12305 for heading detection. Expert Systems (2020). https://doi.org/10.1111/exsy.12520 [27] Chunyu Xia, Tieke He, Wenlong Li, Zemin Qin, and Zhipeng Zou. 2019. Similarity [4] María G. Buey, Angel Luis Garrido, Carlos Bobed, and Sergio Ilarri. 2016. The Analysis of Law Documents Based on Word2vec. In Proceedings - Companion of AIS project: Boosting information extraction from legal documents by using the 19th IEEE International Conference on Software Quality, Reliability and Security, ontologies. In ICAART 2016 - Proceedings of the 8th International Conference on QRS-C 2019. https://doi.org/10.1109/QRS-C.2019.00072 Agents and Artificial Intelligence, Vol. 2. https://doi.org/10.5220/0005757204380445 [5] Nuria Casellas. 2011. Semantic Enhancement of Legal Information. . . Are We Up for the Challenge? VoxPopuLII (2011). [6] Winner Dominic Chawinga, Chaupe, Sellina Khumbo Kapondera, A SOME DEFINITIONS George Theodore Chipeta, Felix Majawa, and Chimango Nyasulu. 2020;. Towards Markup. The markup adds what is usually called, external informa- e–judicial services in Malawi: Implications for justice delivery. 86:e12121 (2020;), 1–15. https://onlinelibrary.wiley.com/doi/epdf/10.1002/isd2.12121 tion, meaning information about the text. Legal markup for court [7] Min Yuh Day and Chao Yu Chen. 2018. Artificial intelligence for automatic judgments: case name, case number, court of hearing, date of case text summarization. In Proceedings - 2018 IEEE 19th International Conference on registration, date of judgment, judge, legal parties such as appellant Information Reuse and Integration for Data Science, IRI 2018. Institute of Electrical and Electronics Engineers Inc., 478–484. https://doi.org/10.1109/IRI.2018.00076 and respondents, lawyers, court clerks. [8] Emile de Maat, Radboud Winkels, and Tom van Engers. 2006. Automated De- Simple Structural Annotation. The word structure is used to tection of Reference Structures in Law. In Legal Knowledge and Information mean a particular general arrangement that is present in most texts. Systems. Jurix 2006: The Nineteenth Annual Conference (Frontiers in Artificial Intelligence and Applications), Tom M van Engers (Ed.), Vol. 152. IOS Press, 41–50. The simplest arrangement can be one in which the text is arranged http://www.leibnizcenter.org/docs/demaat/DeMaat-Jurix2006.pdf in paragraphs, or a text may be arranged in chapters or sections, or [9] Christopher Dozier, Ravikumar Kondadadi, Marc Light, Arun Vachher, Sriharsha Veeramachaneni, and Ramdev Wudali. 2010. Named entity recognition and even more generally, as having the three main parts of introduction, resolution in legal text. In Lecture Notes in Computer Science (including subseries a body and a conclusion. These structural components follow a Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 6036 tree-like hierarchy. LNAI. https://doi.org/10.1007/978-3-642-12837-0_2 [10] R.R. Favretti, F. Tamburini, and E. Martelli. 2007. Words from Bononia Legal Complex Structural Annotation. In this sense, structure is de- Corpus. International Journal of Corpus Linguistics 6, 1 (2007). https://doi.org/ pendent on the nature of the text. For example, a case judgment 10.1075/ijcl.6.3.03ros typically has portions of text in which the facts of the case are [11] John Garofalakis, Konstantinos Plessas, Athanasios Plessas, and Panoraia Spiliopoulou. 2019. Modelling Legal Documents for Their Exploitation as presented, followed by proceedings or the history of the case, e.g., Open Data. In Lecture Notes in Business Information Processing, Vol. 353. https: previous rulings, a discussion of the relevant points of law and the //doi.org/10.1007/978-3-030-20485-3_3 [12] Patrizia GIAMPIERI. 2019. the Bolc for Legal Translations: a Trial Lesson. Com- a conclusion for the case. Structure may also mean rhetorical styles parative Legilinguistics 39 (dec 2019), 21–46. https://doi.org/10.14746/cl.2019.39.2 which are used in some part text. [13] CHENG GONG, JIUYANG TANG, SHENGWEI ZHOU, ZEPENG HAO, and JUN Legal Annotations. The annotation in this case refers to locating WANG. 2019. Chinese Named Entity Recognition with Bert. DEStech Transactions on Computer Science and Engineering cisnrc (2019). https://doi.org/10.12783/ specific pieces of text. This can be specific words, or phrases. Usu- dtcse/cisnrc2019/33299 ally the pieces of interest appear next to each other in the text, but [14] Claire Grover, Ben Hachey, and Ian Hughson. 2004. The HOLJ Corpus. Supporting sometimes they do not. In the case of legal text, one is interested in Summarisation of Legal Texts. COLING 2004 5th International Workshop on Linguistically Interpreted Corpora (2004). (a) legal terminology; (b) citations to laws and statues; (c) citations of [15] Ben Hachey and Claire Grover. 2004. A rhetorical status classifier for legal text other cases. summarisation. In In Proceedings of the ACL-2004 Text Summarization Branches Out Workshop. Legal Resolution. Annotations with case citations or law citations [16] Chu Ren Huang and Yao Yao. 2015. Corpus Linguistics. In International Encyclo- need to be standardised so that documents can be hyperlinked. pedia of the Social & Behavioral Sciences: Second Edition. Elsevier Inc., 949–953. Legal Classification. This usually refers to a semantic arrange- https://doi.org/10.1016/B978-0-08-097086-8.52004-2 [17] Peter Jackson, Khalid Al-Kofahi, Alex Tyrrell, and Arun Vachher. 2003. Informa- ment of the text into a predefined list of categories according to tion extraction from case law and retrieval of prior cases. In Artificial Intelligence, a pre-established criteria. For example, court judgments can be Vol. 150. https://doi.org/10.1016/S0004-3702(03)00106-1 classified according to a court taxonomy, e.g., e.g., civil cases versus [18] Binart Kachule and Amelia Taylor. 2018. Understanding the Factors affecting the Utilisation of the Case Management System of the Malawi Judiciary Conference: criminal cases vs. commercial cases. Some classification criteria are EGPA 2018, EGPA study group XVIII on justice and court administrationAt: not linked to a taxonomy, e.g., one can classify court judgments Lausanne, Switzerland. [19] Marios Koniaris, George Papastefanatos, and Ioannis Anagnostopoulos. 2018. based on the type of crime it mostly deals with say theft versus Solon: A holistic approach for modelling, managing and mining legal sources. homicide. Algorithms 11, 12 (dec 2018). https://doi.org/10.3390/a11120196 Topic Extraction. Topic extraction attempts to discover the most [20] Marios Koniaris, George Papastefanatos, and Yannis Vassiliou. 2016. Towards automatic structuring and semantic indexing of legal documents. In ACM In- important or relevant keywords in documents. so for example, one ternational Conference Proceeding Series. Association for Computing Machinery. would use this to check if the text at hand contains health advice or https://doi.org/10.1145/3003733.3003801 a football match commentary. It is common to use topic extraction [21] Paola Mariani and Costanza Badii. 2005. Methods and techniques for building a digital historic-law dictionary. In Proceedings of the International Conference on in order to classify documents. Artificial Intelligence and Law. 230–231. https://doi.org/10.1145/1165485.1165523 (Un)Structured Legal Text Legal text is by nature quite well or- [22] James C Phillips and Jesse Egbert. 2017. Advancing Law and Corpus Linguistics: Importing Principles and Practices from Survey and Content-Analysis Method- ganise internally, however, by structured legal text we mean text ologies to Improve Corpus Design and Analysis. Brigham Young University Law that contains some or all of the above. Unstructured legal text are Review 2017, 6 (2017). doc, pdf, scanned images of such documents that apart from being [23] Gianluca Pontrandolfo. 2012. Legal Corpora: an overview. stored electronically, do not contain any of the above. NLLP @ KDD 2020, August 24th, San Diego, US Amelia V. Taylor B CORPUS FILES EXAMPLES Table 4: Keywords for extracting legal parties generated from the heading of judgments Modifiers Legal Functions Case Parties Chief Reporter Appellant .... Senior Advocate Respondent Principal Interpreter Applicant <title type="main">Elizabeth Bonomali Vs The State Acting Magistrate Accused Criminal Appeal Case No 7 of 2010 Legal Aid Justice Defendant Deputy Prosecutor State .... Resident Clerk Convict Principal Recording Officer Republic Official Judge Plaintiff Deputy Lawyer Coram IN THE HIGH COURT OF MALAWI Court Principal Witness PRINCIPAL REGISTRY Honourable Republic Acting Counsel .... Table 5: Final Merged Entities for Judgment 17 of 2010 of MWCC CRIMINAL APPEAL CASE NO 7 OF 2010 paragNo Merged Entity Start End 2 Section 214 (a) of the Pe- 117 150 ELIZABETH BONOMALI nal Code THE REPUBLIC 5 Sections 339 and 340 of 897 961 the Criminal Procedure and Evidence Code HON JUSTICE J M CHIRWA 7 Sections 339 and 340 of 1983 2047 Mr Lemucha of Counsel for the State the Criminal Procedure Chipembere of Counsel for the Accused and Evidence Code N Nyirenda Official Interpreter 8 Sections 339 and 340 of 88 152 the Criminal Procedure and Evidence Code 10 Sections 339 and 340 of 183 247 the Criminal Procedure

The Appellant, Elizabethe Bonomali, was convicted and Evidence Code after a full trial of the offence of unlawful wounding 12 Section 254 of the Penal 1034 1063 contrary to Section 214 (a) of the Penal Code and sentenced Code to 12 months' imprisonment with hard labour by the First 13 Sections 339 and 340 of 30 94 Grade

the Criminal Procedure

Magistrate's court at Dalton Road, Limbe, on the 25th and Evidence Code day of February, 2010. She has appealed to this Court 14 Section 339 (1): 0 16 against both the conviction and sentence.

15 section 283 of the Penal 477 506

When the Appeal came up for hearing on the 26th day Code of March 2010 the Appellant indicated that she had 15 Section 340 (1 ): 517 534 abandoned her appeal against the conviction and that her 15 Section 339 792 815 complaint remained against the sentence only. I thus leave 16 sections 15 and 16 22 40 the conviction endorsed by the Learned Magistrate 16 section 283 of the Penal 287 316 unfettered with.

Code ..... 17 Section 339 of the 226 244 18 Section 340 of the Crim- 68 123 inal Procedure and Evi- dence Code MWCC: A Corpus of Malawi Criminal Cases NLLP @ KDD 2020, August 24th, San Diego, US C TYPES OF LAW CITATIONS Table 6: Example of improvements in precision but not re- call using the lg versus the sm scaCy model. • References containing only the name of the law/statue The following offences involving dishonesty in the Penal Code are based on circumstances.... or ...the Control of Goods Act Model Parag Pos. In Parag Entity derives its procedure in criminal matters from the Criminal sm 2 181 Penal Code Procedures and Evidence Code... sm 46 86 section 187(1 • References containing labels and names of the law lg 51 112 section 331 Section 11 (2) of the Supreme Court of Appeal Act. or Section lg/sm 51 127 the Penal Code 283 of the Penal Code. lg 73 75 Bill of • References containing labels and abbreviations, or additional lg/sm 82 33 section 328 names in which a law is known (usually appears in brackets) sm 86 313 Act section 6 of the Control of Goods (Import and Export) sm 86 396 Act section 4 (d) of Part II of the Schedule to Bail (Guidelines) Act lg 86 157 an Act of Parliament s. 149 of CP&EC lg 86 228 an Act of Parliament section 17(d) and 42 of the Liquid Fuel and Gas (Production lg/sm 86 29 Constitution and Supply) Act lg/sm 86 106 Constitution • References containing labels, names or abbreviations, and lg 86 88 section 37 the year or date applicable to the law sm 86 376 section 4(1 review of section 15 of the Code: it is commonplace that the lg/sm 86 320 the Official Secrets Act CP&EC was amended in 2010 lg 90 383 an Act of Parliament section 340(3) of the Proceeds of Crime Act 2002 (POCA) lg/sm 93 115 Freedom of Information Act 2000 • References to laws that are pertaining to other countries sm 93 151 the Data Protection Act (e.g., UK laws mentioned in Malawi court judgments) lg 93 151 the Data Protection Act 1998 section 145 of the New Zealand Crimes Act of 1961 lg/sm 95 42 Section 356 offences against the Person Act, 1861 as held in R v Dica [2004] 2 Cr. App. R. 28 • references by means of anaphors spanning more than one line, or sentence, or paragraph. Table 7: Number of LAW Entities retrieved using the stan- Section 12 of the Act... dard SpaCy model and by an enhanced method (+ Enti- section of the same constitution ... tyRuler and PhraseMatcher). ...in the Penal code...theft from a person (section 282(a)); theft from a dwelling house (section 282 (b)).... Year SpaCy Enhanced Merged Entities Spacy Recall • References containing more than one label, number, e.g., Section 2, 3 and 5 of ... 2010 507 1,162 611 44% 2011 554 1,310 635 42% 2012 153 400 184 38% 2013 3,406 8,432 4,769 40% D RESULTS OF THE SPACY EXPERIMENTS 2014 621 1,640 863 38% 2015 1,044 2,414 1,378 43% 2016 469 1,055 589 44% 2017 236 616 295 38% 2018 597 1,374 772 43% 2019 197 526 294 37% TOTAL 7,784 18,929 10,390 41% Figure 4: Example of pattern for extracting section citations for use with spaCy Entity Ruler patterns = [{ "label":"SECLAW", "pattern":[ {"TEXT": {"REGEX": "^[Ss](ec\.?|ection|ections)$"}}, {"IS_DIGIT":True,'OP':'?'}, {"ORTH": "(",'OP':'?'},{},{"ORTH": ")",'OP':'?'}, {"ORTH": "(",'OP':'?'},{},{"ORTH": ")",'OP':'?'}, {"LOWER":"of",'OP':'?'}] }]