A Machine Learning Approach to Identifying Sections in Legal Briefs Scott Vanderbeck and Joseph Bockhorst Chad Oldfather Dept. of Elec. Eng. and Computer Science Marquette University Law School University of Wisconsin - Milwaukee P.O. Box 1881 P.O. Box 784 , 2200 E. Kenwood Blvd. Milwaukee, WI 53201-1881 Milwaukee, WI 53201-0784 Abstract this that employ a general similarity measure not tailored to the task at hand is that documents are more likely to group With an abundance of legal documents now available in by topics, for instance the type of law, than by, say, ideology. electronic format, legal scholars and practitioners are in need of systems able to search and quantify semantic One general technique that has the potential to improve details of these documents. A key challenge facing de- performance on a wide range of ELS and retrieval tasks is to signers of such systems, however, is that the majority of vary the influence of different sections of a document. For these documents are natural language streams lacking example, studies on ideology, may reduce the influence of formal structure or other explicit semantic information. content in the “Statement of Facts” section while increas- In this research, we describe a two-stage supervised ing the influence of the “Argument” section. However, al- learning approach for automatically identifying section though most briefs have similar types of sections, there are boundaries and types in appellee briefs. Our approach no formal standards for easily extracting them. Computa- uses learned classifiers in a two-stage process to catego- tional techniques are needed. Toward that end, we describe rize white-space separated blocks of text. First, we use here a machine learning approach to automatically identify- a binary classifier to predict whether or not a text block is a section header. Next, we classify those blocks pre- ing sections in legal briefs. dicted to be section headers in the first stage into one of 19 section types. A cross-validation experiment shows Problem Domain our approach has over 90% accuracy on both tasks, and is significantly more accurate than baseline methods. Our focus here is on briefs written for appellate court cases heard by the United States Courts of Appeals. The appeals process begins when one party to a lawsuit, called the appel- Introduction lant, asserts that a trial court’s action was defective in one or more ways by filing an appellant brief. The other party (the Now that most of the briefs, opinions and other legal doc- appellee) responds with an appellee brief, arguing why the uments produced by court systems are routinely encoded trial courts action should stand. In turn, the appeals court electronically and widely available in online databases, there provides its ruling in a written opinion. While there is good is interest throughout the legal community for computational reason to investigate methods for identifying structure in all tools that enable more effective use of these resources. Doc- three kinds of documents, for simplicity we restrict our focus ument retrieval from keyword or Boolean searches are key here to appellee briefs. We conduct our experiment using a tasks that have long been a focus of natural language pro- set of 30 cases heard by the First Circuit in 2004. cessing (NLP) algorithms for the legal domain. However, In the federal courts, the Federal Rules of Appellate the simple whole document word-count representations and Procedure require that appellant briefs include certain sec- document similarity measures that are typically employed tions, and that appellees include some corresponding sec- for retrieval limits their relevance to a relatively narrow set tions while being free to omit others. There is, however, no of tasks. Practicing attorneys and legal academics are find- standard as to section order or how breaks between sections ing that the existing suite of tools fall short of meeting their are to be indicated. Moreover, parties often fail to adhere to growing and complex information needs. the requirements of the rules, with the result being that au- Consider, for example, Empirical Legal Studies (ELS), a thors exercise considerable discretion in how they structure quickly growing area of legal scholarship that aims to ap- and format the documents. ply quantitative, social-science research methods to ques- tions of law. ELS research studies are increasingly likely to have a component that involves computational processing Related Work of large collections of legal documents. One example, are Many genres of text are associated with particular conven- studies of the role of ideological factors that assign an ideol- tional structures. Automatically determining all of these ogy value to legal briefs (e.g., conservative or liberal (Evans types of structures for a large discourse is a difficult and et al. 2006)). One problem that may arise in settings like unsolved problem (Jurafsky & Martin 2000). Much of the previous NLP work in the legal domain concerns Informa- tion Retrieval (IR) and the computation of simple features Block of Text such as word frequency (Grover et al. 2003). 5,442 blocks Additional work has been done in the legal domain with the focus on summarizing documents. Grover et al. de- Task I Is block a 5,190 blocks veloped a method for automatically summarizing legal doc- section heading? No uments from the British legal system. Their method was based on a statistical classifier that categorized sentences in Yes 252 blocks the order that they may be seen as a candidate text excerpt in a summary (Grover et al. 2003). Task II What type Farzindar and Lapalme (2004) also described a method of section heading? for summarizing legal documents. As part of their analysis, they performed thematic segmentation on the documents. Introduction … Finding that more classic method for segmentation (Hearst 19 total sec3on types argument conclusion 1994; Choi 2000) did not provide satisfactory results, they standard of review developed a segmentation process based on specific knowl- edge of their legal documents. For their study groups of adjacent paragraphs were grouped into blocks of text based on the presence of section titles, relative position within the Figure 1: Flowchart of our two stage process for classifying document and linguistic markers. text blocks. The first stage predicts whether or not a block The classic algorithm for topic segmentation is TextTil- of text is a section header. No further processing is done on ing where like sentences and topics are grouped together blocks classified as non-headers. Blocks classified as head- (Hearst 1997). More general methods for topic segmenta- ers are passed to the next stage, which predicts the section tion of a document are generally based on the cohesiveness type. Numbers next to the arrows denote the total number of of adjacent sentences. It is possible to build lexical chains blocks in our annotated dataset that assort to that point. that represent the lexical cohesiveness of adjacent sentences in a document based on important content terms, semanti- cally related references, and resolved anaphors (Moens & classes, etc. that may indicate section breaks or section De Busser 2001). Lexical chains and cohesiveness can then types. Further, document formatting is inconsistent and non- be used to infer the thematic structure of a document. standardized. For example, one author may use italics for In contrast to approaches such as these that are based section headings, another bold, while yet another uses inline on inferring the relatedness of sentences in section bodies, text. Formatting sometimes even varies from section to sec- our approach focuses identifying and catagorizing section tion within the same document. Thus, we ignore formatting headers. These general approaches are complementary as it such as italics or bold, and focus our analysis on the word would be relatively straightforward to construct a combined and character sequenece. method that considers both headers and bodies. Preprocessing was performed on the documents to divide the documents into blocks of text. A block of text is essen- Overview tially a continuous sequence of text from the original docu- Our analysis begins with a pre-processing step that converts ment with a line break immediately before and after. We ex- documents to sequences of text blocks, roughly at the para- tract blocks by converting each HTML document to an XML graph level (see below for details). We next construct fea- document that recognizes all of the line breaks and white ture vector representations for all blocks. Labeled training spaces from the original HTML. Examples of document el- sets and supervised learning methods are used to induce two ements that correspond to blocks extracted from the XML kinds of classifiers: one for distinguishing section header include paragraphs, section headings, section sub-headings, blocks from non-header blocks, and one for classifying the footnotes, and table-of-contents entries. section type of headers. Figure 1 shows a flowchart of the The XML files were manually reviewed and annotated by processing for classifying a block of text in the test set. Note the author (SV). Each block is assigned two class labels: that although the type of non-header blocks is not predicted 1. is header A binary value indicating whether or not a directly, after classifying of all blocks in a document the pre- block is a section heading. dicted section for a non-header block is given by the type of the nearest preceding section header. 2. section type A discrete value that for section headers only indicates section type. As we only predict the type Models and Methods of header blocks, the value of “None” is assigned to non- headers. Table 1 shows the section types we identified in Dataset our dataset. Appellee briefs in our dataset are available as HTML files. The HTML is not well formed or standardized and provides Feature Vector Representation little insight into the structure of the briefs. The HTML ele- Along with the two class labels, we represent each block ments do not contain attributes, block level elements, id’s, of text with a 25 element vector of features values. Ta- Argument Notice To Adverse Party Statement of Parent Companies Bond Prayer And Public Companies Conclusion Preliminary Statement Statement of The Case Corporate Disclosure Statement Procedural History Summary of The Argument Introduction Relief Sought Table of Authorities Issue Presented For Review Standard of Review Table of Contents Jurisdictional Statement Statement of Facts None Table 1: The 20 section types in our dataset. Each predicted header block is classified as one of the 19 types other than “None.” (a) Feature Name Domain Description leadingAsterisk binary True if the block begins with an asterisk (*) leadingNumeral binary True if the block begins with an Arabic or Roman numeral (optionally preceded by an asterisk). endsInPeriod binary True if the block ends with a period (.) endsInNumeral binary True if the block ends with an Arabic or Roman numeral. stringLength integer Number of characters in the block. percentCaps continuous, in [0,1] The % of alpha characters that are capitalized. ellipses binary True if the block contains an ellipses (i.e. “...”). contains(“argument”) binary contains(“authori”) binary contains(“case”) binary contains(“conclusion”) binary contains(“contents”) binary contains(“corporate”) binary contains(“disclosure”) binary Each of these features is an indicator for contains(“fact”) binary a specifc string. The feature contains(s) is contains(“issue”) binary true if the block contains a word that begins contains(“jurisdiction”) binary with the string s and false otherwise. contains(“of”) binary contains(“prayer”) binary contains(“present”) binary contains(“review”) binary contains(“standard”) binary contains(“statement”) binary contains(“summary”) binary contains(“table”) binary (b) leadingAsterisk: FALSE contains(“of”): TRUE endsInPeriod: FALSE contains(“table”: TRUE stringLength: 21 contains(“contents”: TRUE percentCaps: 1 (all other string match features): FALSE leadingNumeral: TRUE endsInNumeral: FALSE is header: TRUE ellipses: FALSE section type: Table of Contents Table 2: (a) Features we use to represent blocks of text. (b) An example showing feature and class values for the block of text “II. TABLE OF CONTENTS” ble 2(a) lists the features we use, Table 2(b) shows the fea- as test data to estimate our models’ ability to generalize to ture and class values for the block of text “II. TABLE OF unseen documents. CONTENTS”. For the first task, all blocks of text in the training set are The features chosen were engineered through visual in- used. For the second task, only training set blocks of text spection of section headings, intuition, and trial and error. labeled as section headings are used for training. This deci- Other attributes were considered such as the length and per- sion was made because we only wish to use the second clas- centage of capital letters of the previous and next blocks of sifier to label the section type of true section headers. Also, text, however, these did not improve model performance. this approach sidesteps the inconsistency that arises when The group of features named contains(s) are string match- a block of text is identified as a heading in the first stage, ing features, which are true if the block of text contains ex- but as section type “None” in the second stage. We may re- actly one word that begins with the string s. We construct a visit this decision in future work as a “None” prediction in string match feature from all words that occur five or more stage two could potentially be used to catch false positives times in the 252 header blocks. from the first stage. With the current dataset, however, it was found that the number of correctly identified headings Learning being labeled as “None” vs. the correction of false positives was not worth the tradeoff. Therefore, we take the approach The task of identifying section headers and the type of sec- described above. tion is divided into two steps (Figure 1). The first step clas- We evaluate models on the first task by the percentage sifies a block of text as either a section heading or not a of headings or non-headings correctly classified as well as section heading. For this task, supervised machine learning precision and recall rates where: algorithms are used to learn a binary classifier. The second task takes each block of text classified in the first step as a #true positives heading and uses a second classifier to predict the specific precision = #true positives + #f alse positives type of section. Again supervised machine learning is used to learn a classifier, this time with 19 classes. For both tasks, and multiple types of classifiers including naive Bayes, logistic #true positives regression, decision trees, support vector machines and neu- recall = #true positives + #f alse negatives ral networks were considered. Note blocks of text that are a section header represent our positive class. Precision and recall are both of particular im- Evaluation portance for our first task. Examining our dataset, 95.4% of With the abundance of legal documents available, it is im- blocks of text are non-headings. The extreme case of clas- portant that they be structured in ways usable by computers sifying all blocks of text as non-headings would then result (Wynera 2010). We hypothesize the task of structuring our in very high overall accuracy and 100% recall rate for non- legal documents into relevant sections can be accomplished headings, at the expense of poor precision. with a supervised machine learning classifier that first iden- We compare our machine learning approach to a regular tifies section headers, and then assigns a section type to the expression baseline. The regular expression used for this header. baseline approach may be summarized as the concatenation To test this hypothesis we have conducted an experiment of the following list of parts: on 30 appellee briefs from cases heard by the US 1st Cir- 1. The beginning of the string cuit in 2004. No effort was made to restrict the cases to a particular area of the law, and indeed a variety of dif- 2. An optional asterisk ferent types of cases is represented in this set. The le- 3. An optional Roman Numeral or Natural Number followed gal briefs were obtained as HTML files through WestLaw by an optional period and space (www.westlaw.com). In the 30 documents, a total of 252 4. A list of zero or more all capitalized words section headers were identified. Note that subsection head- ers are not included as part of this task as there is very little 5. The end of the string commonality in authors use of subsections. Additionally, Blocks that contain a match to the regular expression are subsections are generally specific to the legal case being predicted to be headers. This regular expression should cor- addressed, and not the overall document. Of the 252 total rectly identify many section headings as many are entirely section headers, 116 unique strings were identified (not ac- capitalized, while excluding false positives such as table of counting for any difference in formatting or upper / lower contents entries that are generally followed by a page or sec- case). Manual inspection of the 116 variations revealed that tion number of some form. the headers cluster into the 19 different section types listed Our second task is then evaluated in two ways. The first in Table 2(b). A 20th section type “None” was added to be is the overall percentage of predicted headings that are as- used as the class label for blocks of text that do not represent signed the correct section heading type. The second metric section headers. is an adjusted metric that does not penalize the second task We conducted a leave-one-case-out cross-validation ex- for errors made in the first task. If the input to the second periment. That is, in each experiment all blocks from one classification task was a non-heading to begin with, this clas- of our documents was held out of the training set and used sifier would inherently fail as it is attempting to determine the section heading type when no such type actually exists. Precision Recall F-Measure Therefore, we account for this disparity in our results and Learning Based 0.947 0.921 0.934 also present the number of section heading types predicted Baseline 0.731 0.615 0.668 correctly divided by actual headings correctly classified by the first task. Table 5: Precision and recall of headings for learning based A baseline approach is only considered for the first task classifier vs. baseline approach of identifying whether or not a block of text is a section heading. A baseline approach for the secondary task of as- signing a label of one of our 20 classes could be developed Examining incorrectly classified blocks, the most fre- through a complicated regular expression or a form of se- quent was “Standard of Review” and accounted for 24% of quential logic, but was not considered in this project. Our all errors. Examination of this reveals that the “Standard of most frequent section heading type, “Argument”, accounts Review” is often included as a subsection of the “Argument” for 12% of cases. Therefore, that level of accuracy could be section of the brief by many authors, while others choose to achieved by simply always predicting “Argument”. make a standalone section. For example, the block of text “1. Last, a combined metric is presented where we merged STANDARD OF REVIEW” was incorrectly classified as a the results from both steps of classification to determine heading in one instance. In this case the author did not use the overall percentage of section headings that are correctly a numbering scheme for the primary section (“Argument” in identified and assigned the correct type. this case), but numbered the sub-sections on the document confusing our model. Similar errors occurred for the section Results type “Statement of Facts” and accounted for 12% of all er- rors.’ With additional post processing of the classification, it Task 1 - Identifying Section Headings may be possible to account for these types of errors further A total of 5,442 blocks of text were identified in our dataset. increasing model performance. Table 3 shows a comparison of the baseline method with our supervised machine learning based approach for the task of Task 2 - Predicting Section Type identifying if a block of text is a section heading or not. With Table 6 summarizes the result of the secondary classifier that the exception of naive Bayes (which performed worse), all assigns section types to any block of text classified as a head- other classifiers performed similarly. ing by the first task. The first task identified 245 blocks of text as headings. Of these, only 18 were assigned an incor- Learning rect section heading type for an overall accuracy of 92.7%. Baseline Based However, 13 of these 18 were not actually classes to begin Total Blocks of Text: 5442 5442 with so the secondary classifier could not have assigned a Correctly Classified: 5288 5409 correct class label. Adjusting for this, 232 blocks of text Percentage Correct: 97.2% 99.4% were correctly identified as headings and of these only 5 were given an incorrect label for an adjusted 97.8% accu- Table 3: Results classifying section headings vs. non section racy. headings Count Correctly Percent As expected the baseline approach performed very well Labeled Correct with 97.2% accuracy. This, represents a small gain over Total Headings 245 227 92.7% calling all blocks non-headings (95.4%). As we hypothe- Identified sized, the learning based classifier performed much better Actual Headings 232 227 97.8% with 99.4% accuracy. As seen in the confusion matrix in Identified Table 4, the logistic regression classifier had a similar num- ber of false positives and false negatives. Precision and re- Table 6: Results of secondary classifier assigning class la- call statistics are presented in Table 5. As seen in the table, bels there is a significant difference in the recall rates of head- ings (92.1% vs. 61.5%) which is of great importance to the ultimate goal. Combined Accuracy Combining accuracy from each of the two tasks results in Learning Based Baseline an overall recall rate of 90.1% as seen in Table 7 . Of 252 Actual/ Heading Non- Heading Non- total labels, 232 were correctly identified as labels. Of those Predicted Heading Heading identified, 227 were assigned there correct actual class. Heading 232 20 155 97 Non- 13 5177 57 5133 Conclusion Heading We presented a supervised machine learning approach for structuring legal documents into relevant sections. Our ap- Table 4: Confusion matrix for Task 1 proach is based on two steps. The first step identifies blocks of text that are section headings. In the second step, blocks Actual Correctly Recall Correct Overall cation or as part of a post process mapping the classifications Headings Identified Rate Class Recall output by the classifier to a smaller groups of classes for the 252 232 92.1% 227 90.1% ultimate task. This may potentially further improve overall performance. Table 7: Combined accuracy for identifying and classifying In our approach, the secondary task was treated as indi- section headings vidual classifications. It may be possible to treat the sec- ondary classification problem as a Hidden Markov Model or Continuous Random Field. Doing so may improve perfor- of text classified as section headings are then input into a mance as when an author does include a section in his/her second step to predict section type. legal briefs, they are generally in a consistent order. We evaluated our approach with a cross-validation exper- Last, the majority of misclassifications in both tasks ap- iment. The first task of identifying section headers using a pears to be the result of sparse data and infrequently used binary logistic regression classifier was shown to perform section headings. While learning curves were not created, with 99.4% accuracy. The secondary task is then used with it is suspected that additional data could provide the classi- 92.7% accuracy to determine the type of section one is look- fier with information about many these sections and improve ing at. The NLP approach provides a 2.2% improvement overall model performance. in accuracy over the baseline regular expression based ap- With the current model, and the potential for further future proach, and more importantly provides a significantly higher improvements, section related information can reliably be recall rate in identifying section headings vs. non section identified with supervised machine learning based methods headings. in poorly structured legal documents. While it may be possible to create a non-learning based approach (more complex than the baseline approach pre- References sented) to perform the given subtask, it has been shown that Choi, F. Y. Y. 2000. Advances in Domain-Independent Lin- a machine learning and NLP approach are very well suited ear Text Segmentation. In Proc. Conference of the North for this problem. This paper only researched appellee briefs, American Chapter of the Association for Computational but there is ample reason to believe that this approach would Linguistics (NAACL), 26–33. provide similar results for appellant briefs, the judges writ- ten opinion, and other similar documents. Evans, M. C.; McIntosh, W. V.; Lin, J.; and Cates, C. L. The significance of our learned models having signifi- 2006. Recounting the Courts? Applying Automated Con- cantly higher recall rates than baseline models becomes of tent Analysis to Enhance Empirical Legal Research. SSRN even greater importance when one considers that approaches eLibrary. would be available to correct or account for false positives Farzindar, A., and Lapalme, G. 2004. Legal text summa- (i.e. non-headings classified as headings), however, it would rization by exploration of the thematic structures and argu- be far more difficult, if even possible, to correct for false mentative roles. In In Text Summarization Branches Out negatives (i.e. actual headings classified as non-headings). Conference held in conjunction with ACL 2004, 27–38. While not formally discussed in this paper, it is possible Grover, C.; Hachey, B.; Hughson, I.; and Korycinski, C. to implement secondary logic to correct for some of the clas- 2003. Automatic summarisation of legal documents. In sification errors we encountered. For instance, our most fre- Proceedings of the 9th international conference on Artifi- quent error in the first task was the “Standard of Review. cial intelligence and law, ICAIL ’03, 243–251. New York, Logic could be implemented as a post processing step that NY, USA: ACM. says if a block of text is called a section heading and classi- Hearst, M. A. 1994. Multi-paragraph segmentation of ex- fied with the section heading type “Standard of Review, but pository text. In Proceedings of the 32nd annual meeting is preceded by the section type “Argument, remove this as a on Association for Computational Linguistics, ACL ’94, 9– section heading. In our dataset this correction would correct 16. Stroudsburg, PA, USA: Association for Computational 5 of 7 mistakes made labeling “Standard of Review“ and im- Linguistics. prove accuracy for the first task to 99.5% and 94.6% for the second task. Hearst, M. A. 1997. TextTiling: Segmenting text into In addition, allowing the secondary classifier to identify multi-paragraph subtopic passages. Computational Lin- sections that it assigns the class label “None could correct guistics 23(1):33–64. some false positives incorrectly classified as section head- Jurafsky, D., and Martin, J. H. 2000. Speech and Lan- ings by the first task. In our dataset, 4 such corrections could guage Processing: An Introduction to Natural Language have been made further improving accuracy. However, if Processing, Computational Linguistics and Speech Recog- implementing this change one must consider the implica- nition (Prentice Hall Series in Artificial Intelligence). Pren- tions of giving an actual section break heading the section tice Hall, 1 edition. neue Auflage kommt im Frhjahr 2008. type “None versus the improvement from corrections. Moens, M.-F., and De Busser, R. 2001. Generic topic We considered 20 different potential class labels for each segmentation of document texts. In Proceedings of the 24th section. For specific tasks it may be found that this number annual international ACM SIGIR conference on Research can be reduced to even as few as two (i.e. relevant or non- and development in information retrieval, SIGIR ’01, 418– relevant) sections. This could be done as part of the classifi- 419. New York, NY, USA: ACM. Wynera, A. 2010. Weaving the legal se- mantic web with natural language processing. http://blog.law.cornell.edu/voxpop/2010/05/17/weaving- the-legal-semantic-web-with-natural-language-processing.