=Paper=
{{Paper
|id=Vol-2143/paper2
|storemode=property
|title=Automating Judicial Document Analysis
|pdfUrl=https://ceur-ws.org/Vol-2143/paper2.pdf
|volume=Vol-2143
|authors=L. Karl Branting
|dblpUrl=https://dblp.org/rec/conf/icail/Branting17
}}
==Automating Judicial Document Analysis==
Automating Judicial Document Analysis
L. Karl Branting
The MITRE Corporation
7515 Colshire Drive
McLean, VA 22102, USA
lbranting@mitre.org
ABSTRACT filings is complex and specialized, and the function of a court filing
Collections of documents filed in courts are potentially a rich source depends not just on its text and format, but also on its procedural
of information for citizens, attorneys, and courts, but courts typically context. As a result, successful automation of court filings requires
lack the ability to interpret them automatically. This paper presents overcoming a combination of technical challenges.
technical approaches to three applications of judicial document inter- This paper describes the nature of court dockets and databases,
pretation: detection of document filing errors; matching orders with sets forth three classes of representative judicial document analysis
the motions that they resolve; and predicting the outcome of routine tasks–docket error detection, order/motion matching, and decision
cases. In empirical evaluations on filings from two representative prediction–proposes technical approaches to each of the tasks, and
large US District Courts, the highest accuracy in identifying filing presents preliminary empirical evaluations of the effectiveness of
errors was achieved by combining procedural context features with each approach.
high information-gain lexical features; TF/IDF similarity was found
to be an effective criterion for finding motions that correspond to 2 COURT DOCKETS AND DATABASES
orders; and induction over the texts of prior simple and routine deci- A court docket is a register of document-triggered litigation events,
sions was found to produce a model capable of accurately predicting where a litigation event consists of either (1) a pleading, motion, or
outcomes from case facts without any manually engineered features letter from a litigant, (2) an order, judgment, or other action by a
or factors. judge, or (3) a record of an administrative action (such as notifying an
attorney of a filing error) by a member of the court staff. Each docket
1 INTRODUCTION event in a typical electronic case management system includes (1)
metadata generated at the time of filing, including both case-specific
The transition from paper to electronic filing in national, local, and
data (e.g., case number, parties, judge) and event-specific data (e.g.,
administrative courts, which began in the late 1990s, has transformed
the attorney submitting the document, the intended document type)
how courts operate and how judges, court staff, attorneys, and the
and (2) a text document in PDF format. Each of the two court
public create, submit, and access court filings. However, despite
databases in which the experiments described below were performed
many advances in judicial access and administration brought about
contained filings for over 400,000 cases involving over 1,000,000
by electronic filing, courts are typically unable to interpret the con-
litigants, attorneys, and judges, over 10,000,000 docket entries, and
tents of court filings automatically. Instead, court filings are inter-
more than 4,000,000 documents.
preted only when they are read by an attorney, judge, or court staff
member.
Machine interpretation of court filings promises a rich source 3 DOCKET ERROR DETECTION
of information for improving court administration and case man- There are many kinds of docket errors, including defects in a submit-
agement, access to justice, and analysis of the judiciary. The de- ted document (e.g., missing signature, sensitive information in an
velopment of large-scale text analytics makes such interpretation unsealed document, missing case caption) and mismatches between
increasingly feasible, as collections of court documents are, in ef- the content of a document and the context of the case (e.g., wrong
fect, annotated by the metadata generated when they are submitted, parties, case number, or judge; mismatch between the document title
by corrections when they are audited, or, for those documents that and the document type asserted by the user). Some errors consist of
are motions or claims, by the decisions of judges or other decision violations of a particular court’s local rules and are therefore unique
makers. to that court. Other events, such as filing in a wrong case, constitute
However, there are numerous challenges to automating the in- errors in any court. In either case, detection of defects at submission
terpretation of case filings. Courts often accept documents in the time could spare attorneys the embarrassment of submitting a de-
form of PDFs created from scans. Scanned PDFs require optical fective document and the inconvenience and delays of refiling. For
character recognition (OCR) for text extraction, but this process court staff, automated filing error detection could reduce the quality
introduces many errors and does not preserve the document layout, control (QC) auditing staff required for filing errors, a significant
which contains important information about the relationships among drain of resources in many courts.
text segments in the document. Moreover, the language of court
In: Proceedings of the Second Workshop on Automated Semantic Analysis of Informa-
3.1 Error Detection through Text Classification
tion in Legal Text (ASAIL 2017), June 16, 2017, London, UK. In the court in which the first set of experiments were conducted, the
Copyright © 2017 held by the authors. Copying permitted for private and academic
purposes.
QC staff review filings to detect a variety of docket errors, including
Published at http://ceur-ws.org the following four error types:
ASAIL 2017, June 16, 2017, London, UK L. Karl Branting
• Event-type errors, i.e., specifying the wrong event type
for a document, e.g., submitting a Motion for Summary
Judgment as a Counterclaim. In experiments involving this
court, there were 20 event types, such as complaint, transfer,
notice, order, service, etc.
• Main-vs-attachment errors, i.e., filing a document, such as
an exhibit, that should be filed as an attachment to another
document, as a main document or filing a document, such
as a Memorandum in Support of a Motion for Summary
Judgment, that should be filed as a main document, as an
attachment.
• Show-cause order errors. In some courts, only judges are
permitted to file show-cause orders; it is an error if an
attorney does so.
Figure 1: The information gain of the 10 highest-information
• Letter-motion errors. In some courts, certain routine mo-
terms for 3 legal-document classification tasks.
tions can be filed as letters, but all other filings must have
a formal caption. Recognizing these errors requires distin-
guishing letters from non-letters.
Event-type errors appear to be the most common docket errors in
U.S. District courts.
Each of these filing errors can be detected by classifying a doc-
ument with respect to the corresponding set of categories (event
type, main vs. attachment, show-cause order vs. non-show-cause
order, or letter vs. non-letter) and evaluating whether the category is
consistent with the metadata generated in the docket system by the
filer’s selections. Event-type document classification is particularly
challenging both because document types are both numerous and
skewed, having a roughly power-law frequency distribution in the
test set.
The first set of experiments attempted to identify each of the four
docket errors above by classifying document text and determining
whether there is a conflict between the apparent text category and
the document’s metadata. Classification was performed with the
lingpipe1 LMClassifier, which performs joint probability-based clas-
sification of token sequences into non-overlapping categories based Figure 2: Reduction of a full document to just high information-
on language models for each category and a multivariate distribution gain terms.
over categories.
3.1.1 Term Selection and Document Truncation. Court fil-
ings can be thought of as comprising four distinct sets of terms: We hypothesized that only procedure terms are relevant to the
type of a document, so we explored approaches to filtering non-
• Procedural words, which describe the intended legal func-
procedure terms. Elimination of irrelevant terms can not only speed
tion of the document (e.g., “complaint,” “amended,” “coun-
execution, but in some cases has been shown to increase accuracy
sel”)
[13].
• “Stop-words," which are non-content common words, such
Three approaches to term selection were investigated: two ad hoc
as “of” and “the”
and domain-specific; and one general and domain-independent. The
• Words unique to the case, such as names, and words ex-
first approach was to eliminate all terms except non-stopwords that
pressing the narrative events giving rise to the case; and
occur in the Federal Rules of Civil Procedure [7]. A related alter-
• Substantive (as opposed to procedural) legal terms (e.g.,
native approach was to remove all terms except for non-stopwords
“reasonable care,” “intent,” “battery”).
occurring in “event” (i.e., document) descriptions typed by filers
Terms in the first of these sets–procedural words–carry the most when they submit into the docket system. The third approach was to
information about the type of the document. These words tend to select terms based on their mutual information with each particular
be concentrated around the beginning of legal documents, often in text categories [3]. The first lexical set, termed FRCP, contains 2658
the case caption, and at the end, where distinctive phrases like “so terms; the second, termed event, consists of 513 terms. Separate
ordered” may occur. mutual-information sets were created for each classification task,
reflecting the fact that the information gain from a term depends on
1 http://alias-i.com/lingpipe/ the category distribution of the documents. For example, Figure 1
Automating Judicial Document Analysis ASAIL 2017, June 16, 2017, London, UK
Table 1: Thresholds and size of large and small high information-gain term sets.
showcause main_attch types letter
ig_small 0.01 (135) 0.025 (262) 0.1 (221) 0.0005 (246)
ig_large 0.0025 (406) 0.0125 (914) 0.05 (689) 0.00001 (390)
had little effect on accuracy for the letter and main vs. attachment de-
tection tasks. No reduced-vocabulary set consistently outperformed
the others. This indicates that restricted term sets derived through
information gain perform roughly as well as those produced using
domain-specific information, suggesting that the reduced vocabulary
approach is appropriate for situations in which domain-specific term
information is unavailable.
Summarizing over the tests, the the highest mean f-measure based
on text classification alone and the particular combination of param-
eters that led to this accuracy for each classification task were as
follows:
(1) Event type: 0.743 (prefix=50, 4-gram ig_large vocabulary,
20 categories)
(2) Main-vs-attachment: 0.871 (prefix=256, 6-gram, event
Figure 3: Classification accuracy as a function of reduced vocab- vocabulary)
ulary (8-fold cross validation using a 4-gram language model, (3) Show-cause order: 0.957 (prefix=50, 5-gram, ig_small
50-token prefix length, and no suffix). vocabulary)
(4) Letter-vs-non-letter: 0.889 (prefix=50, no 4-gram, ig_large
vocabulary)
shows the 10 highest information terms for three different classifica-
tion tasks: event-type classification, distinguishing letters from non 3.2 Incorporating Procedural Context Features
letters, and show-cause order detection, illustrating that the most
The accuracy of event-type detection (f-measure of roughly 0.743
informative terms differ widely depending on the classification task.
under the best combinations of parameters) is sufficiently low that
Figure 2 illustrates the reduction of full document text to just high
its utility for many auditing functions may be limited. An analysis
information gain terms, which typifies the vocabulary-reduction
of the classification errors produced by the event-type text classi-
process.
fication model indicated that a document’s event type depends not
Several approaches to document truncation were explored as well.
just on the text of the document but also on its procedural context.
The first was to limit the text to the first l tokens of the document
For example, motions and orders are sometimes extremely similar
(i.e., excise the remainder of the document). If l is sufficiently large,
because judges grant a motion by adding and signing an order stamp
this is equivalent to including the entire document. A second option
to the motion. Since stamps and signatures are seldom accurately
is to include the last l tokens of the suffix as well as the prefix.
OCR’d, the motion and order may be indistinguishable by the text
3.1.2 Evaluation of Alternative Term Reduction Approaches. alone under these circumstances. However, orders can be issued only
Two different information-gain thresholds were tested for each clas- by a judge, and judges never file motions, so the two cases can be
sification type, intended to create one small set of very-high in- distinguished by knowing the filer. In addition, attachments have the
formation terms (ig_small) and a larger set created using a lower same event type as the main document in CM/ECF. So, for example,
threshold (ig_large). The thresholds and sizes of the large and small a memorandum of law is ordinarily a main document, but in some
high information-gain term sets are set forth in Table 1. The text of courts a memorandum can be filed as an attachment, in which case
each document was obtained by OCR using the open-source program its event type is the same as that of the main document to which it is
Tesseract [20]. Each text was normalized by removing non-ASCII attached.
characters and standardizing case prior to term selection, if any. Contextual information potentially relevant to a document’s type
Figure 3 shows a comparison of four vocabulary alternatives on includes: whether it was filed as a main document or as an attach-
the four text classification tasks described above. These tests mea- ment; the filer (e.g., attorney, clerk, judge); the type of the case (e.g.,
sured mean f-measure in 8-fold cross validation using a 4-gram criminal, civil, multi-district); and the document length (e.g., memo-
language mode, 50-token prefix length, and no suffix. In the baseline randa are typically long; minute orders typically short). Combining
vocabulary set, normalize, non-ASCII characters, numbers, and these non-lexical features with text features requires a different
punctuation are removed and tokens were lower-cased. The results classifier than the language-model classifier used in the first set of
show that classification accuracy using an unreduced vocabulary was experiments.
significantly lower than the best reduced vocabulary performance for We compared the performance of SupportVector Machine (SVM)
show-cause order detection and type classification. Term selection learning (WEKA’s implementation of Platt’s algorithm for sequential
ASAIL 2017, June 16, 2017, London, UK L. Karl Branting
Figure 4: Event type classification accuracy as a function of re-
duced vocabulary (10-fold cross validation using a 2-gram lan-
guage model, normalization of dates, numbers, and parties, 100-
token prefix length, and minimum token frequency of 32, with Figure 5: The proportion of groups for which the order is more
43 event types). similar to the triggering motion than to any other motion.
A straightforward approach to this task is to treat order/motion
minimal optimization [10, 16]) and Random Forests [5], both in the matching as an information-retrieval task, under the hypothesis that
WEKA [9] implementation, on the task of filing event classification. an order is likely to have a higher degree of similarity to its cor-
For each filing event, the document text was normalized by filtering responding motions than to motions that it does not rule on. An
stop words, normalizing dates and numbers to standard tokens, and obvious approach is to present pending motions to the clerk in rank
replacing each instance of a party name with the role of that party order of their TF/IDF2 -weighted cosine similarity to the order.
(e.g., DFT, PTF). The result was combined with the contextual The evaluation above showing that term selection improves docu-
features and converted into a sparse n-gram frequency vector from ment classification raises the question whether term selection might
which the {1,2,4,8} thousand highest information gain features were be beneficial for order/motion matching as well. A second question
selected (unsurprisingly, the contextual features always had higher is whether the IDF motion should be trained on an entire corpus of
information gain than any lexical feature). The training set consisted motions and orders or whether acceptable accuracy can be obtained
of 28,763 main documents having 43 distinct types representing 2 by training just on the order and pending motions.
month’s filings in a large US District court. To evaluate the effectiveness of this approach to order/motion
As shown in Figure 4, the SVM was consistently more accurate, match, a subset of the document set described above was collected
with little variation in accuracy as a function of the number of consisting of 3,356 groups, each comprising (1) an order, (2) a
features, although accuracy was slightly higher at 4,000 features motion that the order rules on (a triggering motion), and (3) a non-
than other feature set sizes. By contrast, the accuracy of the random empty set of all motions that were pending at the time of the order but
forest diminished with increasing numbers of features. The highest not ruled on by the order (non-triggering motions). The mean number
accuracy SVM configuration, f-measure of 0.926, was much higher of motions per group was 5.87 (i.e., there were on average 4.87
than the maximum observed with text-only classification (albeit, in non-triggering motions). For each group, all motions were ranked
a different court). This suggests that including procedural context by similarity to the order under the given metric. The proportion
features is essential for accurate document filing type identification of triggering motions that were ranked first and mean rank of the
in judicial databases, and that algorithms that can handle both textual triggering motion were calculated from each group’s ranking.
and categorical features should be used for this task. These groups were evaluated using three term selection approaches:
the raw document text (which often contains many OCR errors);
4 ORDER/MOTION MATCHING normalization, as described above; and event terms. The two al-
ternative TF/IDF training models were applied to each of the three
In many federal courts, docket clerks are responsible for filing orders
term selection approaches, for a total of 6 combinations. For each
executed by judges into the docket system, a process that requires the
combination, the mean rank of the triggering motion among all the
clerk to identify all pending motions to which the order responds and
motions was determined.
to link the order to those motions. This entails reading all pending
Figure 5 shows that the highest accuracy, as measured by the
motions, a tedious task. If the motions corresponding to an order
proportion of triggering motions that were ranked first among all
could be identified automatically, docket clerks would be relieved
pending motions, was achieved by normalizing the text without
of this laborious task. Even ranking the motions by their likelihood
term selection. Intuitively, reduction to procedurally relevant terms
of being resolved by a given order would decrease the burden on
improves the ability to determine what docket event a document
docket clerks. Moreover, order/motion matching is a subtask of a
performs, but can reduce the ability to discern the similarity between
more general issue-chaining problem, which consists of identifying
corresponding pairs of documents. TF/IDF training on just the order
the sequence of preceding and subsequent documents relevant to a
given document. 2 Term Frequency/Inverse Document Frequency
Automating Judicial Document Analysis ASAIL 2017, June 16, 2017, London, UK
Unfortunately, in many simple and routine administrative do-
mains, only the decisions themselves, but not the underlying case
records, are available. However, in such cases, the statement of facts
in the decision can be used as a proxy for the contents of the cor-
responding case record. This approach was applied to decisions of
the European Court of Human Rights in [1], which found that case
outcomes could be predicted to some degree from statements of fact.
The predictability of case outcomes from the statement of facts in
the decision document doesn’t conclusively demonstrate that the
outcome would be equally predictable from the raw case record; the
decision maker’s description of the facts may have been tailored to
fit the outcome. However, a demonstration that case outcomes can
be predicted to some extent by models trained from fact statements
alone may encourage courts and agencies to experiment with this
approach to creating decision-support tools for pro se litigants and
Figure 6: The mean rank of the triggering order among all decision makers.
pending orders, zero-indexed (lower is better, zero is perfect). Accordingly, an experiment was performed to evaluate the fea-
sibility of predicting decisions from the fact statements of cases
in representative domain: World Intellectual Property Organization
and pending motions (local) is at least as accurate as training over (WIPO) domain name decisions.3 Domain name decisions resolve
all orders and motions (all). Figure 6 shows the mean rank (zero disputes between a domain name registrant and a third party under
indexed) of the most similar motion under each of the six conditions. the Uniform Domain Name Dispute Resolution Policy (UDRP).4
The best (lowest) mean rank was achieved with normalization and The UDRP Administrative Procedure applies to disputes concerning
local TF/IDF training. an alleged abusive registration of a domain name under the following
It is not unusual for a single order to rule on multiple pending criteria:
motions. A more realistic assessment of the utility of pending motion
• The domain name registered by the domain name registrant
ranking is therefore to determine how many non-triggering motions
is identical or confusingly similar to a trademark or ser-
a clerk would have to consider if the clerk read each motion in
vice mark in which the complainant (the person or entity
rank order until every motion ruled on by the order is found. One
bringing the complaint) has rights; and
way to express this quantity is as mean precision at 100% recall.
• The domain name registrant has no rights or legitimate
In the test set described above, using text normalization and local
interests in respect of the domain name in question; and
TF/IDF training, mean precision at 100% recall was 0.83, indicating
• The domain name has been registered and is being used in
that the number of motions that a clerk would have to be read was
bad faith
significantly reduced.
WIPO decisions have a very consistent structure, including sec-
5 DECISION PREDICTION tions for History, Background, Contentions, Findings, and Decision.
Just two distinct decisions are possible: transferring the domain
Predictive models of decision making could be useful to pro se liti-
name or denying the complaint. As a result, the decisions are well
gants (for help in understanding the strength of a case), to attorneys
suited to the supervised concept-learning approach described above.
(for help in making strategic litigation decisions), and for training
and decision support for judges and other decision makers. Even
5.1 Experimental Design
if the accuracy of predictive models were only approximate, they
could nevertheless be valuable for decision support by helping to Six thousand six hundred WIPO decisions were downloaded and
identify the most relevant words, phrases, or other features of a case parsed into the five sections described above. Each decision was
record and the most relevant previous decisions. labeled TRUE or FALSE based on whether the decision transferred
Highly-accurate predictive models would require very detailed the domain name (TRUE) or denied the claim (FALSE). The result-
linguistic analysis of the text of case records and decisions, including ing set of cases had significant class skew, with 6,000 instances of
argument structure, narrative analysis, etc. [8]. However, predictive TRUE but only 500 instances of FALSE. For a preliminary study, the
models induced from simpler lexical features may be sufficiently 500 random TRUE instances were subsampled to create a balanced
accurate to be useful for the tasks listed above. Inducing such mod- test set with 500 instances of each category.
els can be cast as supervised concept learning over corpora of case This balanced set of cases was converted into a series of test sets
records and decisions, where each decision is treated as a category la- differing in which sections were included as the text of each instance.
bel for the corresponding case record. This approach is feasible only The sections tested were as follows:
for simple and routine cases for which it is possible to enumerate a • History
small set of category labels, such as granting or denying a specific • Background
benefit or form of relief. However, such simple and routine cases are • Contentions
characteristic of many forms of administrative adjudication, such as 3 http://www.wipo.int/amc/en/domains/decisions.html
immigration status and benefits entitlement. 4 https://www.icann.org/resources/pages/policy-2012-02-25-en
ASAIL 2017, June 16, 2017, London, UK L. Karl Branting
Figure 7: Mean f-measure in ten-fold cross-validation with Sup-
port Vector Machine prediction of WIPO case outcomes.
• The concatenation of History, Background, and Contentions
• Findings
The text of each instance was normalized by standardizing case and
removing punctuation and, in addition, either (1) removing stop
words, or (2) retaining stop words but replacing dates and numbers
with standard tokens (“NUMBER" or “DATE"). The test condition
in which the text consists of Findings is included for completeness,
although it is not a good proxy for the case record as it contains
conclusions about the facts.
For each selection of case sections and standardization method, Figure 8: A subset of high information-gain terms in WIPO for
the text was converted into n-gram frequency vectors for n=1–4, with History/Background/Contentions instances.
only those n-grams retained that occur at least 8 times. The result was
converted into sparse arff format,5 loaded in Weka, and evaluated
in 10-fold cross-validation using Weka’s implementation support excerpt shows that phrases concerning filing, failure to submit a re-
vector machine (SVM) with sequential minimal optimization. sponse, notification, and default are particularly strongly associated
with the outcome of the case. One may view high information-gain
Table 2: Mean f-measure in ten-fold cross-validation with Sup- phrases as being similar to the factors in [2] with the difference that
port Vector Machine prediction of WIPO case outcomes. The they are induced automatically rather than being crafted manually.
text of each instance consists of the History (H), Background The SVM decision surface represents the set of tradeoffs among
(B), Contentions (C), all three (HBC) or Findings (F) section. these factors that is most consistent with the training data, in a man-
ner reminiscent of [6], but without the necessity of domain-specific
H B C HBC F hand-engineered factors.
stopwords 0.943 0.758 0.822 0.950 0.955 WIPO domain name dispute cases may be particularly conducive
nums/dates 0.902 0.750 0.813 0.948 0.960 to predictive modeling owing to their binary outcomes and relatively
stereotypical fact patterns. This experiment does not address the
differences between the case record and the facts as summarized in
the decision document, and the evaluation above artificially dimin-
5.2 Experimental Results ished the effect of class skew by subsampling to produce a balanced
As set forth in Table 2 and Figure 7, the greatest predictive accuracy test set. Nevertheless, the impressive accuracy of a predictive model
was achieved by the combination of the History, Background, and trained on raw text without any feature design or knowledge en-
Contentions sections of each case (HBC). The predictive accuracy gineering suggests that this approach may have great promise for
from these three sections, f-measure of roughly 0.95, was almost as increasing access to justice for pro se litigants and improving train-
high as the accuracy of prediction based on the text of the Findings ing and decision support for decision makers in domains with many
section. routine adjudications.
To understand why the HBC text is so predictive, it is helpful
to examine the terms with the highest mutual information with the 6 RELATED WORK
concept to be predicted, some of which are shown in Figure 8. This The history of applying text classification techniques to legal doc-
5 http://www.cs.waikato.ac.nz/ml/weka/arff.html uments dates back at least to the 1970s [4]. Text classification has
Automating Judicial Document Analysis ASAIL 2017, June 16, 2017, London, UK
been recognized as of particular importance for electronic discov- [2] V. Aleven and K. Ashley. Doing things with factors. In Proceedings of the
ery [18]. Little prior work has addressed classification of docket Third European Workshop on Case-Based Reasoning (EWCR-96), pages 76–90,
Lausanne, Switzerland, November 1996.
entries other than Nallapati and Manning [14], which achieved an [3] R. Battiti. Using mutual information for selecting features in supervised neural
f-measure of 0.8967 in distinguishing Orders to Show Cause from net learning. IEEE Transactions on Neural Networks, 5(4):537–550, Jul 1994.
[4] J. Boreham and B. Niblett. Classification of legal texts by computer. Information
other document types using a hand-engineered feature set. Processing & Management, 12(2):125 – 132, 1976.
There is extensive current activity in predictive models trained [5] L. Breiman. Random forests. Mach. Learn., 45(1):5–32, Oct. 2001.
on factors unrelated to the merits of the case such as the nature [6] S. Brüninghaus and K. Ashley. Generating legal arguments and predictions from
case texts. In Proceedings of the Tenth International Conference on Artificial
of suit, attorneys, forum, judge, and parties [19]. Recent startups Intelligence & Law (ICAIL-05), pages 65–74, Bologna, Italy, June 6–11 2005.
marketing predictive models for litigation support based on non [7] L. I. I. Cornell University Law School. The federal rules of civil procedure.
merits-based factors include Lex Machina [11], LexPredict [12], https://www.law.cornell.edu/rules/FRCP.
[8] D. Gutfreund, Y. Katz, and N. Slonim. Automatic arguments construction–from
and Premonition [17]. The insurance industry has a long history of search engine to research engine. In 2016 AAAI Fall Symposium Series, 2016.
developing decision prediction based on the merits of a claim, but [9] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The
weka data mining software: An update. SIGKDD Explorations, 11(1), 2009.
these models are typically manually constructed, e.g., [15]. Outcome [10] S. Keerthi, S. Shevade, C. Bhattacharyya, and K. Murthy. Improvements to platt’s
prediction based the merits of the case as extracted directly from raw smo algorithm for svm classifier design. Neural Computation, 13(3):637–649,
text is a relatively new research area, with little work outside of [1]. 2001.
[11] Lex machina. https://lexmachina.com/ [Accessed: 27 November 2016].
[12] Lexpredict. https://lexpredict.com/ [Accessed: 29 November 2016].
[13] R. E. Madsen, S. Sigurdsson, L. K. Hansen, and J. Larsen. Pruning the vocab-
7 SUMMARY AND FUTURE WORK ulary for better context recognition. In Pattern Recognition, 2004. ICPR 2004.
Judicial document collections contain a rich trove of potential infor- Proceedings of the 17th International Conference on, volume 2, pages 483–488.
IEEE, 2004.
mation, but analyzing these documents presents many challenges. [14] R. Nallapati and C. D. Manning. Legal docket-entry classification: Where machine
This paper has demonstrated how many types of filing error detection learning stumbles. In Proceedings of the Conference on Empirical Methods in
can be formulated as text classification problems. The highest accu- Natural Language Processing, EMNLP ’08, pages 438–446, Stroudsburg, PA,
USA, 2008. Association for Computational Linguistics.
racy was obtained by combining lexical features that characterize [15] M. Peterson and D. Waterman. Rule-based models of legal expertise. In C. Walters,
the document itself with procedural context features that indicate the editor, Computing Power and Legal Reasoning, pages 627–659. West Publishing
Company, Minneapolis, Minnesota, 1985.
role that the document is intended to play. These results demonstrate [16] J. C. Platt. Fast training of support vector machines using sequential minimal
the feasibility of automating portions of the process of auditing court optimization. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances
submissions, which could significant reduce a persistentdrain on in Kernel Methods, pages 185–208. MIT Press, Cambridge, MA, USA, 1999.
[17] Premonition. https://premonition.ai/ [Accessed: 27 November 2016].
court resources. [18] H. L. Roitblat, A. Kershaw, and P. Oot. Document categorization in legal electronic
The experiment with order/motion matching demonstrates that discovery: computer classification vs. manual review. Journal of the American
while term selection may improve accuracy for document classifica- Society for Information Science and Technology, 61(1):70–80, 2010.
[19] M. Surdeanu, R. Nallapati, G. Gregory, J. Walker, and C. Manning. Risk analysis
tion, it can decrease accuracy for tasks that involve matching based for intellectual property litigation. In Proceedings of the Thirteenth International
on overall similarity rather than procedural similarity. Conference on Artificial Intelligence and Law, Pittsburgh, PA, June 6–10 2011.
ACM.
The demonstration of outcome prediction in WIPO decisions [20] Tesseract. https://en.wikipedia.org/wiki/Tesseract_(software).
illustrates that for case corpora with a limited set of possible out-
comes and relatively stereotypical fact patterns, decision models
of impressive accuracy can be induced without hand-engineered
features, simply from the fact descriptions. This approach may be
particularly promising for decision support and improved access to
justice in the simpler and more routine end of the judicial spectrum.
No single technology is applicable to all judicial documents, nor
is any approach sufficient for all document analysis tasks. However,
each addition to this suite of technologies adds to the capabilities
available to the courts, government agencies, and citizens to exploit
the deep well of information latent in judicial document corpora.
ACKNOWLEDGMENT
The MITRE Corporation is a not-for-profit Federally Funded Re-
search and Development Center chartered in the public interest. This
document is approved for Public Release, Distribution Unlimited,
Case Number 17-0362. ©2017 The MITRE Corporation. All rights
reserved.
REFERENCES
[1] N. Aletras, D. Tsarapatsanis, D. Preotiuc-Pietro, and V. Lampos. Predicting
judicial decisions of the European Court of Human Rights: a natural language
processing perspective. PeerJ CompSci, October 24 2016. https://peerj.com/
articles/cs-93/.