=Paper=
{{Paper
|id=Vol-2143/paper5
|storemode=property
|title=Exploring the Use of Text Classification in the Legal Domain
|pdfUrl=https://ceur-ws.org/Vol-2143/paper5.pdf
|volume=Vol-2143
|authors=Octavia-Maria Şulea,Marcos Zampieri,Shervin Malmasi,Mihaela Vela,Liviu P. Dinu,Josef van Genabith
|dblpUrl=https://dblp.org/rec/conf/icail/SuleaZMVDG17
}}
==Exploring the Use of Text Classification in the Legal Domain==
Exploring the Use of Text Classification in the Legal Domain Octavia-Maria Şulea Marcos Zampieri Shervin Malmasi University of Bucharest, Romania University of Wolverhampton, United Harvard Medical School, United States Kingdom Mihaela Vela Liviu P. Dinu Josef van Genabith Saarland University, Germany University of Bucharest, Romania Saarland University, Germany DFKI, Germany Abstract present day. We explore the use of lexical features and Support In this paper, we investigate the application of text classification Vector Machine (SVM) ensembles on predicting the law area, the methods to support law professionals. We present several experi- ruling, and on estimating the date of the ruling. We compare the ments applying machine learning techniques to predict with high results of our method to those reported by a previous study [24] accuracy the ruling of the French Supreme Court and the law area which used the same data. Finally, we also investigate how much to which a case belongs to. We also investigate the influence of the of the final case description attached to the judge’s ruling needs to time period in which a ruling was made on the form of the case be masked to obtain a synthetic draft description, close to what a description and the extent to which we need to mask information lawyer would have at their disposal and how predictable the ruling in a full case ruling to automatically obtain training and test data is based on this description. All results reported in this paper are in that resembles case descriptions. We developed a mean probability fact on predictions based on these synthetic draft case descriptions, ensemble system combining the output of multiple SVM classifiers. where what is to be predicted is masked in the training and test We report results of 98% average F1 score in predicting a case ruling, data and its descriptions in terms of features. 96% F1 score for predicting the law area of a case, and 87.07% F1 score on estimating the date of a ruling. 2 Related Work 1 Introduction While text classification methods were investigated and applied with commercial or forensic goals in mind for other areas (e.g. serv- Text classification methods have been successfully applied to a ing better content or products to users through user profiling [23] number of NLP tasks and applications ranging from plagiarism [2] and sentiment analysis, identifying potential criminals [25], crimes and pastiche detection [6] to estimating the period in which a text [21], or anti-social behavior [4]), an area where these methods was published [18]. In this paper we discuss the application of text have been under-explored, although both commercial and forensic classification methods in the legal domain which, to the best of our interests exist, is the legal domain. knowledge, is relatively under-explored and to date its application Assuming that argumentation plays an important role in law has been mostly restricted to forensics [5]. practice, [19] investigate to which extent one can automatically In this paper we argue that law professionals would greatly identify argumentative propositions in legal text, their argumen- benefit from the type of automation provided by machine learning. tative function and structure. They use a corpus containing legal This is particularly the case of legal research, more specifically the texts extracted from the European Court of Human Rights (ECHR) preparation a legal practitioner has to undertake before initiating and classify argumentative vs. non-argumentative sentences with or defending a case. The objective of the research reported in this an accuracy of 80%. paper is the following: given a case, law professionals have to make Based on the association between a legal text and its domain label complex decisions including which area of law applies to a given in a database of legal texts, [3] present a classification approach to case, what the ruling might be, which laws apply to the case, etc. identify the relevant domain to which a specific legal text belongs. Given the data available on previous court rulings, is it possible Using TF-IDF weighting and Information Gain for feature selection train text classification systems that are able to predict some of and SVM for classification, [3] attain an f1-measure of 76% for the these decisions, given a textual “draft” case description provided identification of the domains related to a legal text and 97.5% for by the professional? Such a system could act as a decision support the correct classification of a text into a specific domain. system or at least a sanity check for law professionals. Following the observation of a thematic structure in Canadian At present, law professionals have access to court ruling data court rulings, where the intro, context, reasoning, and conclusion through search portals1 and keyword based search. In our work were found to be independent of the ruling itself, [8] present an we want to go beyond this: instead of keyword based search, we automatic summarization of court rulings. [9] introduce a hybrid use the full “draft” case description and text classification methods. summarization system for legal text which combines hand crafted For this purpose we acquire a large corpus of French court rulings knowledge base rules with already existing automatic summariza- with over 126,000 documents, spanning from the 1800s until the tion techniques. 1 An example of a German website: https://www.juris.de/jportal/index.jsp [11] proposed a system of classifying sentences for the task of summarizing court rulings and, with the use of SVM and Naive In: Proceedings of the Second Workshop on Automated Semantic Analysis of Informa- tion in Legal Text (ASAIL 2017), June 16, 2017, London, UK. Bayes applied to Bag of Words, TF-IDF, and dense features (e.g. Copyright © 2017 held by the authors. Copying permitted for private and academic position of sentence in document), obtained 65% f1 on 7 classes. purposes. Similarly, another study [10] used BOW, POS tags, and TF-IDF to Published at http://ceur-ws.org ASAIL 2017, June 16, 2017, London, UK Şulea et al. classify legal text in 3000 categories, based on a taxonomy of legal Law Area # of cases concepts, and reported 64% and 79% f1. CHAMBRE_SOCIALE 33,139 For court ruling prediction, the task closest to our present work, a CHAMBRE_CIVILE_1 20,838 few papers have been published: [12], using extremely randomized CHAMBRE_CIVILE_2 19,772 trees, reported 70% accuracy in predicting the US Supreme Court’s CHAMBRE_CRIMINELLE 18,476 behavior and, more recently, [26] tackled the task of predicting CHAMBRE_COMMERCIALE 18,339 patent litigation and time to litigation (TTL) and obtained lower CHAMBRE_CIVILE_3 15,095 than baseline 19% f1 for predicting the litigation outcome, but a ASSEMBLEE_PLENIERE 544 remarkable 87% f1 for TTL prediction, when the interval considered CHAMBRE_MIXTE 222 was less than 4 years, and only 43% f1 when the interval considered Table 1. Distribution of cases according to the law area. was narrowed down to less than a year. Among the most recent studies, [1] proposed a computational method to predict decisions of the European Court of Human Rights (ECRH) and [24] applied linear SVM classifiers to predict the decisions of the French Supreme For task 2, ruling prediction, we carry out two sets of experiments. Court using the same dataset presented in this paper. A first set of experiments (6-class setup) considers only the first As evidenced in this section predicting court rulings is a new word within each label and only those labels which appeared more area for text classification methods and our paper contributes in than 200 times in the corpus. This lead to an initial set of 6 unique this direction, achieving performance substantially higher than in labels: cassation, annulation, irrecevabilite, rejet, non-lieu, and qpc previous work [24]. (question prioritaire de constitutionnalitÃľ ). In the second set of ruling prediction experiments (8-class setup), we consider all labels 3 Corpus and Data Preparation which had over 200 dataset entries and this time we did not reduce In this paper, we use the diachronic collection of court rulings from them to their first word as shown in Table 2. the French Supreme Court, Court of Cassation (Court de Cassa- tion). The complete collection2 contains 131,830 documents each consisting of a unique court ruling including metadata formatted First-word ruling (6-class setup) # of cases in XML. Common metadata available in most documents include: rejet 68,516 law area, time stamp, case ruling (e.g. cassation, rejet, non-lieu, etc.), cassation 53,813 case description, and cited laws. We use the metadata provided as irrecevabilite 2,737 “natural” labels to be predicted by the machine learning system. In qpc 409 order to simulate realistic test scenarios, we automatically remove annulation 377 all mentions from the training and test data that explicitly refer to non-lieu 246 our target prediction classes. Full ruling (8-class setup) # of cases During pre-processing, we removed all duplicate and incom- cassation 37,659 plete entries in the dataset. This resulted in a corpus comprising of cassation sans renvoi 2,078 126,865 unique court rulings. Each instance contains a case descrip- cassation partielle 9,543 tion and four different types of labels: a law area, the date of ruling, cassation partielle sans renvoi 1,015 the case ruling itself, and a list of articles and laws cited within the cassation partielle cassation 1,162 description. cassation partielle rejet cassation 906 Taking the results by [24], henceforth Şulea et al. (2017), as a rejet 67,981 baseline, in this paper we tackle 3 tasks: irrecevabilite 2,376 1. Predicting the law area of cases and rulings (Section 5.1). Table 2. Distribution of cases according to ruling type. 2. Predicting the court ruling based on the case description (Section 5.2). 3. Estimating the time span when a case description and a Finally, in task 3, we investigate whether the text of the case de- ruling were issued (Section 5.3). scription contained indicators of the period when it was written, a Deciding which labels to use in each experiment was not trivial popular NLP task called temporal text classification addressed by a as this information was very often not explicit in the instances of recent SemEval task [22]. Table 3 shows the distribution of cases in the dataset and the distribution of instances in the classes was very each decade. Due to the amount of cases, we grouped all cases dated unbalanced and sometimes inconsistent. For this reason, here we 1959 and before in a single class. We run temporal text classification follow the decisions taken by Şulea et al. (2017) and summarize experiments with 7 classes. Table 3 shows the distribution of cases them next. per decade. For task 1, predicting the law area of cases and rulings, out of 17 For the three tasks we eliminated the occurrence of each word initial unique labels, the 8 labels that appeared in the corpus more of the label from the text of the corresponding case description than 200 times were kept. Table 1 shows the distribution of cases following the methodology described in Şulea et al. (2017). For task among each label. 1, law area prediction, we eliminated all words contained in the label. For predicting the ruling, we eliminated the ruling words them- 2 Acquired from https://www.legifrance.gouv.fr selves from all case descriptions. Aiming at a complete masking Exploring the Use of Text Classification in the Legal Domain ASAIL 2017, June 16, 2017, London, UK Time Span # of cases in Section 3 (i.e. removing all “give-away” references in the original Until 1959 201 data to simulate a realistic draft case description scenario, where 1960 - 1969 4,797 the prediction - here in task 1 law area - is not already preempted). 1970 - 1979 23,964 Table 4 reports the average precision, recall, f1 score, and accuracy 1980 - 1989 18,233 scores obtained of our method when discriminating between the 1990 - 1999 16,693 aforementioned 8 classes each of them containing at least 200 in- 2000 - 2009 12,577 stances. The scores reported by Şulea et al. (2017) using the same 2010 - 2016 4,541 dataset are presented for comparison. Table 3. Distribution of cases in seven time intervals. Model P R F1 Acc. Ensemble 96.8% 96.8% 96.5% 96.8% of the ruling, we additionally looked at the top 20 most important Şulea et al. (2017) 90.9% 90.2% 90.3% 90.2% features of each class to investigate whether some of them could be Table 4. Classification results for law area prediction. directly linked to the target label. In this step, we realized that the label was present both in its nominal form (e.g. cassation, irrecev- abilite) and in its verbal form (e.g. casse, casser) and eliminated both. For the task of predicting the century and decade in which a We observe that the ensemble method outperforms the liner SVM particular ruling took place, we eliminated all digits from the case classifier by a large margin, 96.8% accuracy compared to 90.3% description text, even though some of the digits referred to cited reported by Şulea et al. (2017). We investigate the performance of laws. the ensemble system for each individual class by looking at the confusion matrix presented in Figure 1. 4 Methodology We approach the three tasks using a system based on classifier ensembles. Classifier ensembles have proven to achieve high per- Confusion Matrix formance in many task classification tasks such as grammatical ASSEMBLEE_PLENIERE error detection [27], complex word identification [15], identifying CHAMBRE_CIVILE_1 0.8 self-harm risk in mental health forums [16], and dialect identifica- CHAMBRE_CIVILE_2 tion [17]. 0.6 There are many types of classifier ensembles and in this work we CHAMBRE_CIVILE_3 True label apply a mean probability classifier. The method works by adding CHAMBRE_COMMERCIALE probability estimates for each class together and assigning the 0.4 class label with the highest average probability as the prediction. CHAMBRE_CRIMINELLE By using probability outputs in this way a classifier’s support for CHAMBRE_MIXTE 0.2 the true class label is taken into account, even when it is not the CHAMBRE_SOCIALE predicted label (e.g. it could have the second highest probability). 0.0 This method is considered to be simple and it has been shown to ERE E_1 E_2 E_3 LE LE IXTE ALE CIA CHA MINEL IVIL IVIL IVIL OCI work well on a wide range of problems. It is intuitive, stable [14] ENI E_M CHA OMMER E_C E_C E_C _PL E_S MBR RI and resilient to estimation errors [13] making it one of the most E_C LEE MBR MBR MBR MBR E_C MBR EMB CHA CHA CHA CHA robust combiners described in the literature. MBR ASS As features, our system uses word unigrams and word bigrams. CHA Predicted label To evaluate the success of our method we compare the results ob- tained by the mean probability ensemble system with the results reported in Şulea et al. (2017) who approach the three tasks de- Figure 1. Confusion matrix for law area prediction. scribed in this paper using the scikit-learn implementation [20] of the LIBLINEAR SVM classifier [7] trained on bag of words and bag The confusion matrix presented in Figure 1 shows that cases from of bigrams. the chambre mixte are the most difficult predict. This is firstly Finally, as to the evaluation, we employ a stratified 10-fold cross- because this class and assemblee pleniere, the second most difficult validation setup for all experiments. We chose this approach to class to predict, contain the two lowest numbers of instances in the be able to compare our results with those reported by Şulea et al. dataset (222 and 544 respectively), and secondly because by nature (2017) and also to take the inherent imbalance of the classes present the chambre mixte received mixed cases from other courts such as in the dataset into account. We report results in terms of average civil and commercial. precision, recall, F1 score, and accuracy for all classes. 5.2 Case Ruling 5 Results The results for the second task, court ruling prediction, are pre- 5.1 Law Area sented in Table 5. We report the results obtained in both experiment In our first experiment, we trained our system to predict the law setups, the 6-class setup and in the 8-class setup. The mean proba- area of a case, given its case description preprocessed as described bility ensemble once again outperforms the method by Şulea et al. ASAIL 2017, June 16, 2017, London, UK Şulea et al. (2017) in both settings. We observe a 2.9 percentage point decrease Model P R F1 Acc. in absolute average f1 score when the ensemble classifier is trained Ensemble 87.3% 87.0% 87.0% 87.0% on the dataset with more classes which is explained by the increase Şulea et al. (2017) 75.9% 74.3% 73.2% 74.3% in number of classes from 6 to 8 leading to a more challenging Table 6. Classification results for temporal prediction. classification scenario. Results obtained by the ensemble system in this experiment out- Classes Model P R F1 Acc. perform the method by Şulea et al. (2017) by a large margin. This 6 Ensemble 98.6% 98.6% 98.6% 98.6% outcome once again confirms the robustness of classifier ensembles 6 Şulea et al. (2017) 97.1% 96.9% 97.0% 96.9% for many text classification tasks including those presented in this 8 Ensemble 95.9% 96.2% 95.8% 96.2% paper. The mean probability ensemble system achieved 87% f1 score 8 Şulea et al. (2017) 93.2% 92.8% 92.7% 92.8% against 73.2% reported by Şulea et al. (2017). Table 5. Classification results for ruling prediction. The results obtained by our system in the temporal text clas- sification task suggest that classifier ensembles are a good fit for predicting the publication date not only of legal texts but other To better understand the difficulties faced by our method in dis- types of texts as well. This is a particularly relevant application for criminating between the ruling classes we first looked at the list researchers in the digital humanities who are often working with of the most informative unigrams for each class. We found a few manuscripts with unknown or uncertain publication date. The use clear cases of top-ranked words that are related to the target class, of ensembles for this task is, to the best of our knowledge, under but even so the analysis did not go that far indicating that a more explored and should be investigated further. interesting analysis is only possible without the aid of an expert in It should be noted, however, that predictions in this experiment French law. are only estimates as the definition of time spans in unities such as Subsequently, we looked at the confusion matrix of predictions. month, year, or decade (in the case of this paper) is arbitrary. Previ- In Figure 2 we present a confusion matrix of the performance ous work in temporal text classification stressed that supervised obtained for each individual class in the 6-class setup experiment. methods, such as the one presented in this paper fail to capture We observe that the two most difficult classes for the system were the linearity of time [18, 28]. Other methods, such as ranking or non-lieu and annulation. These two classe are also the two classes regression, could be applied to obtain more accurate predictions. which contained the least amount of examples which probably led to the poor performance of the classifier in identifying instances 6 Conclusions and Future Work from these classes. In this paper we investigated the application of text classification methods to the legal domain using the cases and rulings of the French Supreme Court. We showed that a system based on SVM Confusion Matrix ensembles can obtain high scores in predicting the law area and ANNULATION the ruling of a case, given the case description, and the time span of cases and rulings. The ensemble method presented in this paper 0.8 CASSATION outperformed a previously proposed Şulea et al. (2017) using the same dataset. IRRECEVABILITE 0.6 We applied computational methods to mask the case description True label attached to a judge’s ruling so that they convey as little information NON as possible about the ruling. This simulates the knowledge a lawyer 0.4 would have prior to entering court. The work presented in this paper confirms that text classifica- QPC 0.2 tion techniques can indeed be used to provide valuable assistive technology base as support for law professionals in obtaining guid- REJET ance and orientation from large corpora of previous court rulings. 0.0 In future work, we would like to investigate the extent to which ION LITE NON QPC T N REJE TIO a more accurate draft form can be induced from the court’s case SAT I VAB ULA CAS description. ANN ECE IRR Predicted label Acknowledgements Parts of this work have been carried out while the first and the Figure 2. Confusion matrix for law area prediction. second author, Octavia-Maria Şulea and Marcos Zampieri, were at the German Research Center for Artificial Intelligence (DFKI). 5.3 Temporal Text Classification We would like to thank the anonymous reviewers for providing us with constructive feedback and suggestions. Finally, in Table 6 we present the results obtained in the third set of experiments described in this paper, predicting the time References span of cases and rulings in a 7-class setting. Again all data was [1] Nikolaos Aletras, Dimitrios Tsarapatsanis, Daniel Preoţiuc-Pietro, and Vasileios preprocessed as indicated in Section 3. Lampos. 2016. Predicting Judicial Decisions of the European Court of Human Exploring the Use of Text Classification in the Legal Domain ASAIL 2017, June 16, 2017, London, UK Rights: A Natural Language Processing Perspective. PeerJ Computer Science 2 (2016), e93. [2] Alberto Barrón-Cedeño, Marta Vila, M Antònia Martí, and Paolo Rosso. 2013. Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection. Computational Linguistics 39, 4 (2013), 917–947. [3] Guido Boella, Luigi Di Caro, , and Llio Humphreys. 2011. Using classification to support legal knowledge engineers in the Eunomos legal document management system. In Proceedings of JURISIN. [4] Justin Cheng, Cristian Danescu-Niculescu-Mizil, and Jure Leskovec. 2015. Anti- social Behavior in Online Discussion Communities. In Proceedings of ICWSM. [5] Olivier De Vel, Alison Anderson, Malcolm Corney, and George Mohay. 2001. Mining E-mail Content for Author Identification Forensics. ACM Sigmod Record 30, 4 (2001), 55–64. [6] Liviu P Dinu, Vlad Niculae, and Octavia-Maria Şulea. 2012. Pastiche Detection Based on Stopword Rankings: Exposing Impersonators of a Romanian Writer. In Proceedings of the Workshop on Computational Approaches to Deception Detection. [7] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research 9 (2008), 1871–1874. [8] Atefeh Farzindar and Guy Lapalme. 2004. Legal Text Summarization by Explo- ration of the Thematic Structures and Argumentative Roles. Proceedings of the Text Summarization Branches Out Workshop (2004). [9] Filippo Galgani, Paul Compton, and Achim Hoffmann. 2012. Combining Different Summarization Techniques for Legal Text. In Proceedings of the Hybrid Workshop. [10] Teresa Gonçalves and Paulo Quaresma. 2005. Evaluating preprocessing tech- niques in a text classification problem. In Proceedings of the Conference of the Brazilian Computer Society. [11] Ben Hachey and Claire Grover. 2006. Extractive Summarisation of Legal Texts. Artificial Intelligence and Law 14, 4 (2006), 305–345. [12] Daniel Martin Katz, Michael J. Bommarito II, and Josh Blackman. 2014. Predicting the Behavior of the Supreme Court of the United States: A General Approach. CoRR abs/1407.6333 (2014). http://arxiv.org/abs/1407.6333 [13] Josef Kittler, Mohamad Hatef, Robert PW Duin, and Jiri Matas. 1998. On Com- bining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 3 (1998), 226–239. [14] Ludmila I. Kuncheva. 2014. Combining Pattern Classifiers: Methods and Algorithms (second ed.). Wiley. [15] Shervin Malmasi, Mark Dras, and Marcos Zampieri. 2016. LTG at SemEval-2016 Task 11: Complex Word Identification with Classifier Ensembles. In Proceedings of SemEval. [16] Shervin Malmasi, Marcos Zampieri, and Mark Dras. 2016. Predicting Post Severity in Mental Health Forums. In Proceedings of the CLPsych Workshop. [17] Shervin Malmasi, Marcos Zampieri, Nikola Ljubešić, Preslav Nakov, Ahmed Ali, and Jörg Tiedemann. 2016. Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task. In Proceedings of the VarDial Workshop. [18] Vlad Niculae, Marcos Zampieri, Liviu P Dinu, and Alina Maria Ciobanu. 2014. Temporal Text Ranking and Automatic Dating of Texts. Proceedings of EACL (2014). [19] Raquel Mochales Palau and Marie-Francine Moens. 2009. Argumentation Mining: The Detection, Classification and Structure of Arguments in Text. In Proceedings of the ICAIL. [20] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12 (Oct 2011), 2825– 2830. [21] Verónica Pérez-Rosas and Rada Mihalcea. 2015. Experiments in Open Domain Deception Detection. In Proceedings of EMNLP, Lluís Màrquez, Chris Callison- Burch, Jian Su, Daniele Pighin, and Yuval Marton (Eds.). The Association for Computational Linguistics. http://aclweb.org/anthology/D/D15/D15-1133.pdf [22] Octavian Popescu and Carlo Strapparava. 2015. SemEval 2015, Task 7: Diachronic Text Evaluation. In Proceedings of SemEval. [23] O. Roozmand, N. Ghasem-Aghaee, M.A. Nematbakhsh, A. Baraani, and G.J. Hofstede. 2011. Computational Modeling of Uncertainty Avoidance in Consumer Behavior. International Journal of Research and Reviews in Computer Science (April 2011), 18–26. [24] Octavia-Maria Sulea, Marcos Zampieri, Mihaela Vela, and Josef van Genabith. 2017. Predicting the Law Area and Decisions of French Supreme Court Cases. In Proceedings of Recent Advances in Natural Language Processing (RANLP). [25] Chris Sumner, Alison Byers, Rachel Boochever, and Gregory J. Park. 2012. Predict- ing Dark Triad Personality Traits from Twitter Usage and a Linguistic Analysis of Tweets. In Proceedings of ICMLA. DOI:http://dx.doi.org/10.1109/ICMLA.2012.218 [26] Papis Wongchaisuwat, Diego Klabjan, and John O McGinnis. 2016. Predict- ing Litigation Likelihood and Time to Litigation for Patents. arXiv preprint arXiv:1603.07394 (2016). [27] Yang Xiang, Xiaolong Wang, Wenying Han, and Qinghua Hong. 2015. Chinese Grammatical Error Diagnosis Using Ensemble Learning. In Proceedings of the NLP-TEA Workshop. [28] Marcos Zampieri, Shervin Malmasi, and Mark Dras. 2016. Modeling Language Change in Historical Corpora: The Case of Portuguese. In Proceedings of LREC.