=Paper=
{{Paper
|id=None
|storemode=property
|title=A Machine Learning Approach to Identifying Sections in Legal Briefs
|pdfUrl=https://ceur-ws.org/Vol-710/paper23.pdf
|volume=Vol-710
|dblpUrl=https://dblp.org/rec/conf/maics/VanderbeckBO11
}}
==A Machine Learning Approach to Identifying Sections in Legal Briefs==
A Machine Learning Approach to
Identifying Sections in Legal Briefs
Scott Vanderbeck and Joseph Bockhorst Chad Oldfather
Dept. of Elec. Eng. and Computer Science Marquette University Law School
University of Wisconsin - Milwaukee P.O. Box 1881
P.O. Box 784 , 2200 E. Kenwood Blvd. Milwaukee, WI 53201-1881
Milwaukee, WI 53201-0784
Abstract this that employ a general similarity measure not tailored to
the task at hand is that documents are more likely to group
With an abundance of legal documents now available in by topics, for instance the type of law, than by, say, ideology.
electronic format, legal scholars and practitioners are in
need of systems able to search and quantify semantic
One general technique that has the potential to improve
details of these documents. A key challenge facing de- performance on a wide range of ELS and retrieval tasks is to
signers of such systems, however, is that the majority of vary the influence of different sections of a document. For
these documents are natural language streams lacking example, studies on ideology, may reduce the influence of
formal structure or other explicit semantic information. content in the “Statement of Facts” section while increas-
In this research, we describe a two-stage supervised ing the influence of the “Argument” section. However, al-
learning approach for automatically identifying section though most briefs have similar types of sections, there are
boundaries and types in appellee briefs. Our approach no formal standards for easily extracting them. Computa-
uses learned classifiers in a two-stage process to catego- tional techniques are needed. Toward that end, we describe
rize white-space separated blocks of text. First, we use here a machine learning approach to automatically identify-
a binary classifier to predict whether or not a text block
is a section header. Next, we classify those blocks pre-
ing sections in legal briefs.
dicted to be section headers in the first stage into one of
19 section types. A cross-validation experiment shows Problem Domain
our approach has over 90% accuracy on both tasks, and
is significantly more accurate than baseline methods.
Our focus here is on briefs written for appellate court cases
heard by the United States Courts of Appeals. The appeals
process begins when one party to a lawsuit, called the appel-
Introduction lant, asserts that a trial court’s action was defective in one or
more ways by filing an appellant brief. The other party (the
Now that most of the briefs, opinions and other legal doc- appellee) responds with an appellee brief, arguing why the
uments produced by court systems are routinely encoded trial courts action should stand. In turn, the appeals court
electronically and widely available in online databases, there provides its ruling in a written opinion. While there is good
is interest throughout the legal community for computational reason to investigate methods for identifying structure in all
tools that enable more effective use of these resources. Doc- three kinds of documents, for simplicity we restrict our focus
ument retrieval from keyword or Boolean searches are key here to appellee briefs. We conduct our experiment using a
tasks that have long been a focus of natural language pro- set of 30 cases heard by the First Circuit in 2004.
cessing (NLP) algorithms for the legal domain. However, In the federal courts, the Federal Rules of Appellate
the simple whole document word-count representations and Procedure require that appellant briefs include certain sec-
document similarity measures that are typically employed tions, and that appellees include some corresponding sec-
for retrieval limits their relevance to a relatively narrow set tions while being free to omit others. There is, however, no
of tasks. Practicing attorneys and legal academics are find- standard as to section order or how breaks between sections
ing that the existing suite of tools fall short of meeting their are to be indicated. Moreover, parties often fail to adhere to
growing and complex information needs. the requirements of the rules, with the result being that au-
Consider, for example, Empirical Legal Studies (ELS), a thors exercise considerable discretion in how they structure
quickly growing area of legal scholarship that aims to ap- and format the documents.
ply quantitative, social-science research methods to ques-
tions of law. ELS research studies are increasingly likely
to have a component that involves computational processing
Related Work
of large collections of legal documents. One example, are Many genres of text are associated with particular conven-
studies of the role of ideological factors that assign an ideol- tional structures. Automatically determining all of these
ogy value to legal briefs (e.g., conservative or liberal (Evans types of structures for a large discourse is a difficult and
et al. 2006)). One problem that may arise in settings like unsolved problem (Jurafsky & Martin 2000). Much of the
previous NLP work in the legal domain concerns Informa-
tion Retrieval (IR) and the computation of simple features Block of Text
such as word frequency (Grover et al. 2003). 5,442 blocks
Additional work has been done in the legal domain with
the focus on summarizing documents. Grover et al. de- Task I Is block a 5,190 blocks
veloped a method for automatically summarizing legal doc- section
heading? No
uments from the British legal system. Their method was
based on a statistical classifier that categorized sentences in
Yes 252 blocks
the order that they may be seen as a candidate text excerpt
in a summary (Grover et al. 2003). Task II What type
Farzindar and Lapalme (2004) also described a method of section
heading?
for summarizing legal documents. As part of their analysis,
they performed thematic segmentation on the documents. Introduction …
Finding that more classic method for segmentation (Hearst 19 total sec3on types
argument conclusion
1994; Choi 2000) did not provide satisfactory results, they standard of review
developed a segmentation process based on specific knowl-
edge of their legal documents. For their study groups of
adjacent paragraphs were grouped into blocks of text based
on the presence of section titles, relative position within the Figure 1: Flowchart of our two stage process for classifying
document and linguistic markers. text blocks. The first stage predicts whether or not a block
The classic algorithm for topic segmentation is TextTil- of text is a section header. No further processing is done on
ing where like sentences and topics are grouped together blocks classified as non-headers. Blocks classified as head-
(Hearst 1997). More general methods for topic segmenta- ers are passed to the next stage, which predicts the section
tion of a document are generally based on the cohesiveness type. Numbers next to the arrows denote the total number of
of adjacent sentences. It is possible to build lexical chains blocks in our annotated dataset that assort to that point.
that represent the lexical cohesiveness of adjacent sentences
in a document based on important content terms, semanti-
cally related references, and resolved anaphors (Moens & classes, etc. that may indicate section breaks or section
De Busser 2001). Lexical chains and cohesiveness can then types. Further, document formatting is inconsistent and non-
be used to infer the thematic structure of a document. standardized. For example, one author may use italics for
In contrast to approaches such as these that are based section headings, another bold, while yet another uses inline
on inferring the relatedness of sentences in section bodies, text. Formatting sometimes even varies from section to sec-
our approach focuses identifying and catagorizing section tion within the same document. Thus, we ignore formatting
headers. These general approaches are complementary as it such as italics or bold, and focus our analysis on the word
would be relatively straightforward to construct a combined and character sequenece.
method that considers both headers and bodies. Preprocessing was performed on the documents to divide
the documents into blocks of text. A block of text is essen-
Overview tially a continuous sequence of text from the original docu-
Our analysis begins with a pre-processing step that converts ment with a line break immediately before and after. We ex-
documents to sequences of text blocks, roughly at the para- tract blocks by converting each HTML document to an XML
graph level (see below for details). We next construct fea- document that recognizes all of the line breaks and white
ture vector representations for all blocks. Labeled training spaces from the original HTML. Examples of document el-
sets and supervised learning methods are used to induce two ements that correspond to blocks extracted from the XML
kinds of classifiers: one for distinguishing section header include paragraphs, section headings, section sub-headings,
blocks from non-header blocks, and one for classifying the footnotes, and table-of-contents entries.
section type of headers. Figure 1 shows a flowchart of the The XML files were manually reviewed and annotated by
processing for classifying a block of text in the test set. Note the author (SV). Each block is assigned two class labels:
that although the type of non-header blocks is not predicted 1. is header A binary value indicating whether or not a
directly, after classifying of all blocks in a document the pre- block is a section heading.
dicted section for a non-header block is given by the type of
the nearest preceding section header. 2. section type A discrete value that for section headers
only indicates section type. As we only predict the type
Models and Methods of header blocks, the value of “None” is assigned to non-
headers. Table 1 shows the section types we identified in
Dataset our dataset.
Appellee briefs in our dataset are available as HTML files.
The HTML is not well formed or standardized and provides Feature Vector Representation
little insight into the structure of the briefs. The HTML ele- Along with the two class labels, we represent each block
ments do not contain attributes, block level elements, id’s, of text with a 25 element vector of features values. Ta-
Argument Notice To Adverse Party Statement of Parent Companies
Bond Prayer And Public Companies
Conclusion Preliminary Statement Statement of The Case
Corporate Disclosure Statement Procedural History Summary of The Argument
Introduction Relief Sought Table of Authorities
Issue Presented For Review Standard of Review Table of Contents
Jurisdictional Statement Statement of Facts None
Table 1: The 20 section types in our dataset. Each predicted header block is classified as one of the 19 types other than “None.”
(a)
Feature Name Domain Description
leadingAsterisk binary True if the block begins with an asterisk (*)
leadingNumeral binary True if the block begins with an Arabic or Roman numeral
(optionally preceded by an asterisk).
endsInPeriod binary True if the block ends with a period (.)
endsInNumeral binary True if the block ends with an Arabic or Roman numeral.
stringLength integer Number of characters in the block.
percentCaps continuous, in [0,1] The % of alpha characters that are capitalized.
ellipses binary True if the block contains an ellipses (i.e. “...”).
contains(“argument”) binary
contains(“authori”) binary
contains(“case”) binary
contains(“conclusion”) binary
contains(“contents”) binary
contains(“corporate”) binary
contains(“disclosure”) binary
Each of these features is an indicator for
contains(“fact”) binary
a specifc string. The feature contains(s) is
contains(“issue”) binary
true if the block contains a word that begins
contains(“jurisdiction”) binary
with the string s and false otherwise.
contains(“of”) binary
contains(“prayer”) binary
contains(“present”) binary
contains(“review”) binary
contains(“standard”) binary
contains(“statement”) binary
contains(“summary”) binary
contains(“table”) binary
(b)
leadingAsterisk: FALSE contains(“of”): TRUE
endsInPeriod: FALSE contains(“table”: TRUE
stringLength: 21 contains(“contents”: TRUE
percentCaps: 1 (all other string match features): FALSE
leadingNumeral: TRUE
endsInNumeral: FALSE is header: TRUE
ellipses: FALSE section type: Table of Contents
Table 2: (a) Features we use to represent blocks of text. (b) An example showing feature and class values for the block of text
“II. TABLE OF CONTENTS”
ble 2(a) lists the features we use, Table 2(b) shows the fea- as test data to estimate our models’ ability to generalize to
ture and class values for the block of text “II. TABLE OF unseen documents.
CONTENTS”. For the first task, all blocks of text in the training set are
The features chosen were engineered through visual in- used. For the second task, only training set blocks of text
spection of section headings, intuition, and trial and error. labeled as section headings are used for training. This deci-
Other attributes were considered such as the length and per- sion was made because we only wish to use the second clas-
centage of capital letters of the previous and next blocks of sifier to label the section type of true section headers. Also,
text, however, these did not improve model performance. this approach sidesteps the inconsistency that arises when
The group of features named contains(s) are string match- a block of text is identified as a heading in the first stage,
ing features, which are true if the block of text contains ex- but as section type “None” in the second stage. We may re-
actly one word that begins with the string s. We construct a visit this decision in future work as a “None” prediction in
string match feature from all words that occur five or more stage two could potentially be used to catch false positives
times in the 252 header blocks. from the first stage. With the current dataset, however, it
was found that the number of correctly identified headings
Learning being labeled as “None” vs. the correction of false positives
was not worth the tradeoff. Therefore, we take the approach
The task of identifying section headers and the type of sec-
described above.
tion is divided into two steps (Figure 1). The first step clas-
We evaluate models on the first task by the percentage
sifies a block of text as either a section heading or not a
of headings or non-headings correctly classified as well as
section heading. For this task, supervised machine learning
precision and recall rates where:
algorithms are used to learn a binary classifier. The second
task takes each block of text classified in the first step as a #true positives
heading and uses a second classifier to predict the specific precision =
#true positives + #f alse positives
type of section. Again supervised machine learning is used
to learn a classifier, this time with 19 classes. For both tasks, and
multiple types of classifiers including naive Bayes, logistic #true positives
regression, decision trees, support vector machines and neu- recall =
#true positives + #f alse negatives
ral networks were considered.
Note blocks of text that are a section header represent our
positive class. Precision and recall are both of particular im-
Evaluation portance for our first task. Examining our dataset, 95.4% of
With the abundance of legal documents available, it is im- blocks of text are non-headings. The extreme case of clas-
portant that they be structured in ways usable by computers sifying all blocks of text as non-headings would then result
(Wynera 2010). We hypothesize the task of structuring our in very high overall accuracy and 100% recall rate for non-
legal documents into relevant sections can be accomplished headings, at the expense of poor precision.
with a supervised machine learning classifier that first iden- We compare our machine learning approach to a regular
tifies section headers, and then assigns a section type to the expression baseline. The regular expression used for this
header. baseline approach may be summarized as the concatenation
To test this hypothesis we have conducted an experiment of the following list of parts:
on 30 appellee briefs from cases heard by the US 1st Cir- 1. The beginning of the string
cuit in 2004. No effort was made to restrict the cases to
a particular area of the law, and indeed a variety of dif- 2. An optional asterisk
ferent types of cases is represented in this set. The le- 3. An optional Roman Numeral or Natural Number followed
gal briefs were obtained as HTML files through WestLaw by an optional period and space
(www.westlaw.com). In the 30 documents, a total of 252 4. A list of zero or more all capitalized words
section headers were identified. Note that subsection head-
ers are not included as part of this task as there is very little 5. The end of the string
commonality in authors use of subsections. Additionally, Blocks that contain a match to the regular expression are
subsections are generally specific to the legal case being predicted to be headers. This regular expression should cor-
addressed, and not the overall document. Of the 252 total rectly identify many section headings as many are entirely
section headers, 116 unique strings were identified (not ac- capitalized, while excluding false positives such as table of
counting for any difference in formatting or upper / lower contents entries that are generally followed by a page or sec-
case). Manual inspection of the 116 variations revealed that tion number of some form.
the headers cluster into the 19 different section types listed Our second task is then evaluated in two ways. The first
in Table 2(b). A 20th section type “None” was added to be is the overall percentage of predicted headings that are as-
used as the class label for blocks of text that do not represent signed the correct section heading type. The second metric
section headers. is an adjusted metric that does not penalize the second task
We conducted a leave-one-case-out cross-validation ex- for errors made in the first task. If the input to the second
periment. That is, in each experiment all blocks from one classification task was a non-heading to begin with, this clas-
of our documents was held out of the training set and used sifier would inherently fail as it is attempting to determine
the section heading type when no such type actually exists. Precision Recall F-Measure
Therefore, we account for this disparity in our results and Learning Based 0.947 0.921 0.934
also present the number of section heading types predicted Baseline 0.731 0.615 0.668
correctly divided by actual headings correctly classified by
the first task. Table 5: Precision and recall of headings for learning based
A baseline approach is only considered for the first task classifier vs. baseline approach
of identifying whether or not a block of text is a section
heading. A baseline approach for the secondary task of as-
signing a label of one of our 20 classes could be developed Examining incorrectly classified blocks, the most fre-
through a complicated regular expression or a form of se- quent was “Standard of Review” and accounted for 24% of
quential logic, but was not considered in this project. Our all errors. Examination of this reveals that the “Standard of
most frequent section heading type, “Argument”, accounts Review” is often included as a subsection of the “Argument”
for 12% of cases. Therefore, that level of accuracy could be section of the brief by many authors, while others choose to
achieved by simply always predicting “Argument”. make a standalone section. For example, the block of text “1.
Last, a combined metric is presented where we merged STANDARD OF REVIEW” was incorrectly classified as a
the results from both steps of classification to determine heading in one instance. In this case the author did not use
the overall percentage of section headings that are correctly a numbering scheme for the primary section (“Argument” in
identified and assigned the correct type. this case), but numbered the sub-sections on the document
confusing our model. Similar errors occurred for the section
Results type “Statement of Facts” and accounted for 12% of all er-
rors.’ With additional post processing of the classification, it
Task 1 - Identifying Section Headings may be possible to account for these types of errors further
A total of 5,442 blocks of text were identified in our dataset. increasing model performance.
Table 3 shows a comparison of the baseline method with our
supervised machine learning based approach for the task of Task 2 - Predicting Section Type
identifying if a block of text is a section heading or not. With Table 6 summarizes the result of the secondary classifier that
the exception of naive Bayes (which performed worse), all assigns section types to any block of text classified as a head-
other classifiers performed similarly. ing by the first task. The first task identified 245 blocks of
text as headings. Of these, only 18 were assigned an incor-
Learning rect section heading type for an overall accuracy of 92.7%.
Baseline Based However, 13 of these 18 were not actually classes to begin
Total Blocks of Text: 5442 5442 with so the secondary classifier could not have assigned a
Correctly Classified: 5288 5409 correct class label. Adjusting for this, 232 blocks of text
Percentage Correct: 97.2% 99.4% were correctly identified as headings and of these only 5
were given an incorrect label for an adjusted 97.8% accu-
Table 3: Results classifying section headings vs. non section racy.
headings
Count Correctly Percent
As expected the baseline approach performed very well Labeled Correct
with 97.2% accuracy. This, represents a small gain over Total Headings 245 227 92.7%
calling all blocks non-headings (95.4%). As we hypothe- Identified
sized, the learning based classifier performed much better Actual Headings 232 227 97.8%
with 99.4% accuracy. As seen in the confusion matrix in Identified
Table 4, the logistic regression classifier had a similar num-
ber of false positives and false negatives. Precision and re- Table 6: Results of secondary classifier assigning class la-
call statistics are presented in Table 5. As seen in the table, bels
there is a significant difference in the recall rates of head-
ings (92.1% vs. 61.5%) which is of great importance to the
ultimate goal. Combined Accuracy
Combining accuracy from each of the two tasks results in
Learning Based Baseline an overall recall rate of 90.1% as seen in Table 7 . Of 252
Actual/ Heading Non- Heading Non- total labels, 232 were correctly identified as labels. Of those
Predicted Heading Heading identified, 227 were assigned there correct actual class.
Heading 232 20 155 97
Non- 13 5177 57 5133 Conclusion
Heading We presented a supervised machine learning approach for
structuring legal documents into relevant sections. Our ap-
Table 4: Confusion matrix for Task 1 proach is based on two steps. The first step identifies blocks
of text that are section headings. In the second step, blocks
Actual Correctly Recall Correct Overall cation or as part of a post process mapping the classifications
Headings Identified Rate Class Recall output by the classifier to a smaller groups of classes for the
252 232 92.1% 227 90.1% ultimate task. This may potentially further improve overall
performance.
Table 7: Combined accuracy for identifying and classifying In our approach, the secondary task was treated as indi-
section headings vidual classifications. It may be possible to treat the sec-
ondary classification problem as a Hidden Markov Model or
Continuous Random Field. Doing so may improve perfor-
of text classified as section headings are then input into a mance as when an author does include a section in his/her
second step to predict section type. legal briefs, they are generally in a consistent order.
We evaluated our approach with a cross-validation exper- Last, the majority of misclassifications in both tasks ap-
iment. The first task of identifying section headers using a pears to be the result of sparse data and infrequently used
binary logistic regression classifier was shown to perform section headings. While learning curves were not created,
with 99.4% accuracy. The secondary task is then used with it is suspected that additional data could provide the classi-
92.7% accuracy to determine the type of section one is look- fier with information about many these sections and improve
ing at. The NLP approach provides a 2.2% improvement overall model performance.
in accuracy over the baseline regular expression based ap- With the current model, and the potential for further future
proach, and more importantly provides a significantly higher improvements, section related information can reliably be
recall rate in identifying section headings vs. non section identified with supervised machine learning based methods
headings. in poorly structured legal documents.
While it may be possible to create a non-learning based
approach (more complex than the baseline approach pre- References
sented) to perform the given subtask, it has been shown that
Choi, F. Y. Y. 2000. Advances in Domain-Independent Lin-
a machine learning and NLP approach are very well suited
ear Text Segmentation. In Proc. Conference of the North
for this problem. This paper only researched appellee briefs,
American Chapter of the Association for Computational
but there is ample reason to believe that this approach would
Linguistics (NAACL), 26–33.
provide similar results for appellant briefs, the judges writ-
ten opinion, and other similar documents. Evans, M. C.; McIntosh, W. V.; Lin, J.; and Cates, C. L.
The significance of our learned models having signifi- 2006. Recounting the Courts? Applying Automated Con-
cantly higher recall rates than baseline models becomes of tent Analysis to Enhance Empirical Legal Research. SSRN
even greater importance when one considers that approaches eLibrary.
would be available to correct or account for false positives Farzindar, A., and Lapalme, G. 2004. Legal text summa-
(i.e. non-headings classified as headings), however, it would rization by exploration of the thematic structures and argu-
be far more difficult, if even possible, to correct for false mentative roles. In In Text Summarization Branches Out
negatives (i.e. actual headings classified as non-headings). Conference held in conjunction with ACL 2004, 27–38.
While not formally discussed in this paper, it is possible Grover, C.; Hachey, B.; Hughson, I.; and Korycinski, C.
to implement secondary logic to correct for some of the clas- 2003. Automatic summarisation of legal documents. In
sification errors we encountered. For instance, our most fre- Proceedings of the 9th international conference on Artifi-
quent error in the first task was the “Standard of Review. cial intelligence and law, ICAIL ’03, 243–251. New York,
Logic could be implemented as a post processing step that NY, USA: ACM.
says if a block of text is called a section heading and classi- Hearst, M. A. 1994. Multi-paragraph segmentation of ex-
fied with the section heading type “Standard of Review, but pository text. In Proceedings of the 32nd annual meeting
is preceded by the section type “Argument, remove this as a on Association for Computational Linguistics, ACL ’94, 9–
section heading. In our dataset this correction would correct 16. Stroudsburg, PA, USA: Association for Computational
5 of 7 mistakes made labeling “Standard of Review“ and im- Linguistics.
prove accuracy for the first task to 99.5% and 94.6% for the
second task. Hearst, M. A. 1997. TextTiling: Segmenting text into
In addition, allowing the secondary classifier to identify multi-paragraph subtopic passages. Computational Lin-
sections that it assigns the class label “None could correct guistics 23(1):33–64.
some false positives incorrectly classified as section head- Jurafsky, D., and Martin, J. H. 2000. Speech and Lan-
ings by the first task. In our dataset, 4 such corrections could guage Processing: An Introduction to Natural Language
have been made further improving accuracy. However, if Processing, Computational Linguistics and Speech Recog-
implementing this change one must consider the implica- nition (Prentice Hall Series in Artificial Intelligence). Pren-
tions of giving an actual section break heading the section tice Hall, 1 edition. neue Auflage kommt im Frhjahr 2008.
type “None versus the improvement from corrections. Moens, M.-F., and De Busser, R. 2001. Generic topic
We considered 20 different potential class labels for each segmentation of document texts. In Proceedings of the 24th
section. For specific tasks it may be found that this number annual international ACM SIGIR conference on Research
can be reduced to even as few as two (i.e. relevant or non- and development in information retrieval, SIGIR ’01, 418–
relevant) sections. This could be done as part of the classifi- 419. New York, NY, USA: ACM.
Wynera, A. 2010. Weaving the legal se-
mantic web with natural language processing.
http://blog.law.cornell.edu/voxpop/2010/05/17/weaving-
the-legal-semantic-web-with-natural-language-processing.