=Paper= {{Paper |id=Vol-2143/paper2 |storemode=property |title=Automating Judicial Document Analysis |pdfUrl=https://ceur-ws.org/Vol-2143/paper2.pdf |volume=Vol-2143 |authors=L. Karl Branting |dblpUrl=https://dblp.org/rec/conf/icail/Branting17 }} ==Automating Judicial Document Analysis== https://ceur-ws.org/Vol-2143/paper2.pdf
                             Automating Judicial Document Analysis
                                                                       L. Karl Branting
                                                                   The MITRE Corporation
                                                                     7515 Colshire Drive
                                                                   McLean, VA 22102, USA
                                                                     lbranting@mitre.org
ABSTRACT                                                                            filings is complex and specialized, and the function of a court filing
Collections of documents filed in courts are potentially a rich source              depends not just on its text and format, but also on its procedural
of information for citizens, attorneys, and courts, but courts typically            context. As a result, successful automation of court filings requires
lack the ability to interpret them automatically. This paper presents               overcoming a combination of technical challenges.
technical approaches to three applications of judicial document inter-                  This paper describes the nature of court dockets and databases,
pretation: detection of document filing errors; matching orders with                sets forth three classes of representative judicial document analysis
the motions that they resolve; and predicting the outcome of routine                tasks–docket error detection, order/motion matching, and decision
cases. In empirical evaluations on filings from two representative                  prediction–proposes technical approaches to each of the tasks, and
large US District Courts, the highest accuracy in identifying filing                presents preliminary empirical evaluations of the effectiveness of
errors was achieved by combining procedural context features with                   each approach.
high information-gain lexical features; TF/IDF similarity was found
to be an effective criterion for finding motions that correspond to                 2     COURT DOCKETS AND DATABASES
orders; and induction over the texts of prior simple and routine deci-              A court docket is a register of document-triggered litigation events,
sions was found to produce a model capable of accurately predicting                 where a litigation event consists of either (1) a pleading, motion, or
outcomes from case facts without any manually engineered features                   letter from a litigant, (2) an order, judgment, or other action by a
or factors.                                                                         judge, or (3) a record of an administrative action (such as notifying an
                                                                                    attorney of a filing error) by a member of the court staff. Each docket
1    INTRODUCTION                                                                   event in a typical electronic case management system includes (1)
                                                                                    metadata generated at the time of filing, including both case-specific
The transition from paper to electronic filing in national, local, and
                                                                                    data (e.g., case number, parties, judge) and event-specific data (e.g.,
administrative courts, which began in the late 1990s, has transformed
                                                                                    the attorney submitting the document, the intended document type)
how courts operate and how judges, court staff, attorneys, and the
                                                                                    and (2) a text document in PDF format. Each of the two court
public create, submit, and access court filings. However, despite
                                                                                    databases in which the experiments described below were performed
many advances in judicial access and administration brought about
                                                                                    contained filings for over 400,000 cases involving over 1,000,000
by electronic filing, courts are typically unable to interpret the con-
                                                                                    litigants, attorneys, and judges, over 10,000,000 docket entries, and
tents of court filings automatically. Instead, court filings are inter-
                                                                                    more than 4,000,000 documents.
preted only when they are read by an attorney, judge, or court staff
member.
   Machine interpretation of court filings promises a rich source                   3     DOCKET ERROR DETECTION
of information for improving court administration and case man-                     There are many kinds of docket errors, including defects in a submit-
agement, access to justice, and analysis of the judiciary. The de-                  ted document (e.g., missing signature, sensitive information in an
velopment of large-scale text analytics makes such interpretation                   unsealed document, missing case caption) and mismatches between
increasingly feasible, as collections of court documents are, in ef-                the content of a document and the context of the case (e.g., wrong
fect, annotated by the metadata generated when they are submitted,                  parties, case number, or judge; mismatch between the document title
by corrections when they are audited, or, for those documents that                  and the document type asserted by the user). Some errors consist of
are motions or claims, by the decisions of judges or other decision                 violations of a particular court’s local rules and are therefore unique
makers.                                                                             to that court. Other events, such as filing in a wrong case, constitute
   However, there are numerous challenges to automating the in-                     errors in any court. In either case, detection of defects at submission
terpretation of case filings. Courts often accept documents in the                  time could spare attorneys the embarrassment of submitting a de-
form of PDFs created from scans. Scanned PDFs require optical                       fective document and the inconvenience and delays of refiling. For
character recognition (OCR) for text extraction, but this process                   court staff, automated filing error detection could reduce the quality
introduces many errors and does not preserve the document layout,                   control (QC) auditing staff required for filing errors, a significant
which contains important information about the relationships among                  drain of resources in many courts.
text segments in the document. Moreover, the language of court
In: Proceedings of the Second Workshop on Automated Semantic Analysis of Informa-
                                                                                    3.1    Error Detection through Text Classification
tion in Legal Text (ASAIL 2017), June 16, 2017, London, UK.                         In the court in which the first set of experiments were conducted, the
Copyright © 2017 held by the authors. Copying permitted for private and academic
purposes.
                                                                                    QC staff review filings to detect a variety of docket errors, including
Published at http://ceur-ws.org                                                     the following four error types:
ASAIL 2017, June 16, 2017, London, UK                                                                                         L. Karl Branting


        • Event-type errors, i.e., specifying the wrong event type
          for a document, e.g., submitting a Motion for Summary
          Judgment as a Counterclaim. In experiments involving this
          court, there were 20 event types, such as complaint, transfer,
          notice, order, service, etc.
        • Main-vs-attachment errors, i.e., filing a document, such as
          an exhibit, that should be filed as an attachment to another
          document, as a main document or filing a document, such
          as a Memorandum in Support of a Motion for Summary
          Judgment, that should be filed as a main document, as an
          attachment.
        • Show-cause order errors. In some courts, only judges are
          permitted to file show-cause orders; it is an error if an
          attorney does so.
                                                                           Figure 1: The information gain of the 10 highest-information
        • Letter-motion errors. In some courts, certain routine mo-
                                                                           terms for 3 legal-document classification tasks.
          tions can be filed as letters, but all other filings must have
          a formal caption. Recognizing these errors requires distin-
          guishing letters from non-letters.
Event-type errors appear to be the most common docket errors in
U.S. District courts.
    Each of these filing errors can be detected by classifying a doc-
ument with respect to the corresponding set of categories (event
type, main vs. attachment, show-cause order vs. non-show-cause
order, or letter vs. non-letter) and evaluating whether the category is
consistent with the metadata generated in the docket system by the
filer’s selections. Event-type document classification is particularly
challenging both because document types are both numerous and
skewed, having a roughly power-law frequency distribution in the
test set.
    The first set of experiments attempted to identify each of the four
docket errors above by classifying document text and determining
whether there is a conflict between the apparent text category and
the document’s metadata. Classification was performed with the
lingpipe1 LMClassifier, which performs joint probability-based clas-
sification of token sequences into non-overlapping categories based        Figure 2: Reduction of a full document to just high information-
on language models for each category and a multivariate distribution       gain terms.
over categories.

   3.1.1 Term Selection and Document Truncation. Court fil-
ings can be thought of as comprising four distinct sets of terms:              We hypothesized that only procedure terms are relevant to the
                                                                           type of a document, so we explored approaches to filtering non-
        • Procedural words, which describe the intended legal func-
                                                                           procedure terms. Elimination of irrelevant terms can not only speed
          tion of the document (e.g., “complaint,” “amended,” “coun-
                                                                           execution, but in some cases has been shown to increase accuracy
          sel”)
                                                                           [13].
        • “Stop-words," which are non-content common words, such
                                                                               Three approaches to term selection were investigated: two ad hoc
          as “of” and “the”
                                                                           and domain-specific; and one general and domain-independent. The
        • Words unique to the case, such as names, and words ex-
                                                                           first approach was to eliminate all terms except non-stopwords that
          pressing the narrative events giving rise to the case; and
                                                                           occur in the Federal Rules of Civil Procedure [7]. A related alter-
        • Substantive (as opposed to procedural) legal terms (e.g.,
                                                                           native approach was to remove all terms except for non-stopwords
          “reasonable care,” “intent,” “battery”).
                                                                           occurring in “event” (i.e., document) descriptions typed by filers
Terms in the first of these sets–procedural words–carry the most           when they submit into the docket system. The third approach was to
information about the type of the document. These words tend to            select terms based on their mutual information with each particular
be concentrated around the beginning of legal documents, often in          text categories [3]. The first lexical set, termed FRCP, contains 2658
the case caption, and at the end, where distinctive phrases like “so       terms; the second, termed event, consists of 513 terms. Separate
ordered” may occur.                                                        mutual-information sets were created for each classification task,
                                                                           reflecting the fact that the information gain from a term depends on
1 http://alias-i.com/lingpipe/                                             the category distribution of the documents. For example, Figure 1
Automating Judicial Document Analysis                                                                ASAIL 2017, June 16, 2017, London, UK

                              Table 1: Thresholds and size of large and small high information-gain term sets.

                                                  showcause        main_attch      types        letter
                                      ig_small    0.01 (135)       0.025 (262)     0.1 (221)    0.0005 (246)
                                      ig_large    0.0025 (406)     0.0125 (914)    0.05 (689)   0.00001 (390)


                                                                             had little effect on accuracy for the letter and main vs. attachment de-
                                                                             tection tasks. No reduced-vocabulary set consistently outperformed
                                                                             the others. This indicates that restricted term sets derived through
                                                                             information gain perform roughly as well as those produced using
                                                                             domain-specific information, suggesting that the reduced vocabulary
                                                                             approach is appropriate for situations in which domain-specific term
                                                                             information is unavailable.
                                                                                Summarizing over the tests, the the highest mean f-measure based
                                                                             on text classification alone and the particular combination of param-
                                                                             eters that led to this accuracy for each classification task were as
                                                                             follows:
                                                                                   (1) Event type: 0.743 (prefix=50, 4-gram ig_large vocabulary,
                                                                                       20 categories)
                                                                                   (2) Main-vs-attachment: 0.871 (prefix=256, 6-gram, event
Figure 3: Classification accuracy as a function of reduced vocab-                      vocabulary)
ulary (8-fold cross validation using a 4-gram language model,                      (3) Show-cause order: 0.957 (prefix=50, 5-gram, ig_small
50-token prefix length, and no suffix).                                                vocabulary)
                                                                                   (4) Letter-vs-non-letter: 0.889 (prefix=50, no 4-gram, ig_large
                                                                                       vocabulary)
shows the 10 highest information terms for three different classifica-
tion tasks: event-type classification, distinguishing letters from non       3.2     Incorporating Procedural Context Features
letters, and show-cause order detection, illustrating that the most
                                                                             The accuracy of event-type detection (f-measure of roughly 0.743
informative terms differ widely depending on the classification task.
                                                                             under the best combinations of parameters) is sufficiently low that
    Figure 2 illustrates the reduction of full document text to just high
                                                                             its utility for many auditing functions may be limited. An analysis
information gain terms, which typifies the vocabulary-reduction
                                                                             of the classification errors produced by the event-type text classi-
process.
                                                                             fication model indicated that a document’s event type depends not
    Several approaches to document truncation were explored as well.
                                                                             just on the text of the document but also on its procedural context.
The first was to limit the text to the first l tokens of the document
                                                                             For example, motions and orders are sometimes extremely similar
(i.e., excise the remainder of the document). If l is sufficiently large,
                                                                             because judges grant a motion by adding and signing an order stamp
this is equivalent to including the entire document. A second option
                                                                             to the motion. Since stamps and signatures are seldom accurately
is to include the last l tokens of the suffix as well as the prefix.
                                                                             OCR’d, the motion and order may be indistinguishable by the text
    3.1.2 Evaluation of Alternative Term Reduction Approaches.               alone under these circumstances. However, orders can be issued only
Two different information-gain thresholds were tested for each clas-         by a judge, and judges never file motions, so the two cases can be
sification type, intended to create one small set of very-high in-           distinguished by knowing the filer. In addition, attachments have the
formation terms (ig_small) and a larger set created using a lower            same event type as the main document in CM/ECF. So, for example,
threshold (ig_large). The thresholds and sizes of the large and small        a memorandum of law is ordinarily a main document, but in some
high information-gain term sets are set forth in Table 1. The text of        courts a memorandum can be filed as an attachment, in which case
each document was obtained by OCR using the open-source program              its event type is the same as that of the main document to which it is
Tesseract [20]. Each text was normalized by removing non-ASCII               attached.
characters and standardizing case prior to term selection, if any.               Contextual information potentially relevant to a document’s type
    Figure 3 shows a comparison of four vocabulary alternatives on           includes: whether it was filed as a main document or as an attach-
the four text classification tasks described above. These tests mea-         ment; the filer (e.g., attorney, clerk, judge); the type of the case (e.g.,
sured mean f-measure in 8-fold cross validation using a 4-gram               criminal, civil, multi-district); and the document length (e.g., memo-
language mode, 50-token prefix length, and no suffix. In the baseline        randa are typically long; minute orders typically short). Combining
vocabulary set, normalize, non-ASCII characters, numbers, and                these non-lexical features with text features requires a different
punctuation are removed and tokens were lower-cased. The results             classifier than the language-model classifier used in the first set of
show that classification accuracy using an unreduced vocabulary was          experiments.
significantly lower than the best reduced vocabulary performance for             We compared the performance of SupportVector Machine (SVM)
show-cause order detection and type classification. Term selection           learning (WEKA’s implementation of Platt’s algorithm for sequential
ASAIL 2017, June 16, 2017, London, UK                                                                                        L. Karl Branting




Figure 4: Event type classification accuracy as a function of re-
duced vocabulary (10-fold cross validation using a 2-gram lan-
guage model, normalization of dates, numbers, and parties, 100-
token prefix length, and minimum token frequency of 32, with              Figure 5: The proportion of groups for which the order is more
43 event types).                                                          similar to the triggering motion than to any other motion.


                                                                             A straightforward approach to this task is to treat order/motion
minimal optimization [10, 16]) and Random Forests [5], both in the        matching as an information-retrieval task, under the hypothesis that
WEKA [9] implementation, on the task of filing event classification.      an order is likely to have a higher degree of similarity to its cor-
For each filing event, the document text was normalized by filtering      responding motions than to motions that it does not rule on. An
stop words, normalizing dates and numbers to standard tokens, and         obvious approach is to present pending motions to the clerk in rank
replacing each instance of a party name with the role of that party       order of their TF/IDF2 -weighted cosine similarity to the order.
(e.g., DFT, PTF). The result was combined with the contextual                The evaluation above showing that term selection improves docu-
features and converted into a sparse n-gram frequency vector from         ment classification raises the question whether term selection might
which the {1,2,4,8} thousand highest information gain features were       be beneficial for order/motion matching as well. A second question
selected (unsurprisingly, the contextual features always had higher       is whether the IDF motion should be trained on an entire corpus of
information gain than any lexical feature). The training set consisted    motions and orders or whether acceptable accuracy can be obtained
of 28,763 main documents having 43 distinct types representing 2          by training just on the order and pending motions.
month’s filings in a large US District court.                                To evaluate the effectiveness of this approach to order/motion
   As shown in Figure 4, the SVM was consistently more accurate,          match, a subset of the document set described above was collected
with little variation in accuracy as a function of the number of          consisting of 3,356 groups, each comprising (1) an order, (2) a
features, although accuracy was slightly higher at 4,000 features         motion that the order rules on (a triggering motion), and (3) a non-
than other feature set sizes. By contrast, the accuracy of the random     empty set of all motions that were pending at the time of the order but
forest diminished with increasing numbers of features. The highest        not ruled on by the order (non-triggering motions). The mean number
accuracy SVM configuration, f-measure of 0.926, was much higher           of motions per group was 5.87 (i.e., there were on average 4.87
than the maximum observed with text-only classification (albeit, in       non-triggering motions). For each group, all motions were ranked
a different court). This suggests that including procedural context       by similarity to the order under the given metric. The proportion
features is essential for accurate document filing type identification    of triggering motions that were ranked first and mean rank of the
in judicial databases, and that algorithms that can handle both textual   triggering motion were calculated from each group’s ranking.
and categorical features should be used for this task.                       These groups were evaluated using three term selection approaches:
                                                                          the raw document text (which often contains many OCR errors);
4   ORDER/MOTION MATCHING                                                 normalization, as described above; and event terms. The two al-
                                                                          ternative TF/IDF training models were applied to each of the three
In many federal courts, docket clerks are responsible for filing orders
                                                                          term selection approaches, for a total of 6 combinations. For each
executed by judges into the docket system, a process that requires the
                                                                          combination, the mean rank of the triggering motion among all the
clerk to identify all pending motions to which the order responds and
                                                                          motions was determined.
to link the order to those motions. This entails reading all pending
                                                                             Figure 5 shows that the highest accuracy, as measured by the
motions, a tedious task. If the motions corresponding to an order
                                                                          proportion of triggering motions that were ranked first among all
could be identified automatically, docket clerks would be relieved
                                                                          pending motions, was achieved by normalizing the text without
of this laborious task. Even ranking the motions by their likelihood
                                                                          term selection. Intuitively, reduction to procedurally relevant terms
of being resolved by a given order would decrease the burden on
                                                                          improves the ability to determine what docket event a document
docket clerks. Moreover, order/motion matching is a subtask of a
                                                                          performs, but can reduce the ability to discern the similarity between
more general issue-chaining problem, which consists of identifying
                                                                          corresponding pairs of documents. TF/IDF training on just the order
the sequence of preceding and subsequent documents relevant to a
given document.                                                           2 Term Frequency/Inverse Document Frequency
Automating Judicial Document Analysis                                                                ASAIL 2017, June 16, 2017, London, UK


                                                                               Unfortunately, in many simple and routine administrative do-
                                                                           mains, only the decisions themselves, but not the underlying case
                                                                           records, are available. However, in such cases, the statement of facts
                                                                           in the decision can be used as a proxy for the contents of the cor-
                                                                           responding case record. This approach was applied to decisions of
                                                                           the European Court of Human Rights in [1], which found that case
                                                                           outcomes could be predicted to some degree from statements of fact.
                                                                           The predictability of case outcomes from the statement of facts in
                                                                           the decision document doesn’t conclusively demonstrate that the
                                                                           outcome would be equally predictable from the raw case record; the
                                                                           decision maker’s description of the facts may have been tailored to
                                                                           fit the outcome. However, a demonstration that case outcomes can
                                                                           be predicted to some extent by models trained from fact statements
                                                                           alone may encourage courts and agencies to experiment with this
                                                                           approach to creating decision-support tools for pro se litigants and
Figure 6: The mean rank of the triggering order among all                  decision makers.
pending orders, zero-indexed (lower is better, zero is perfect).               Accordingly, an experiment was performed to evaluate the fea-
                                                                           sibility of predicting decisions from the fact statements of cases
                                                                           in representative domain: World Intellectual Property Organization
and pending motions (local) is at least as accurate as training over       (WIPO) domain name decisions.3 Domain name decisions resolve
all orders and motions (all). Figure 6 shows the mean rank (zero           disputes between a domain name registrant and a third party under
indexed) of the most similar motion under each of the six conditions.      the Uniform Domain Name Dispute Resolution Policy (UDRP).4
The best (lowest) mean rank was achieved with normalization and            The UDRP Administrative Procedure applies to disputes concerning
local TF/IDF training.                                                     an alleged abusive registration of a domain name under the following
   It is not unusual for a single order to rule on multiple pending        criteria:
motions. A more realistic assessment of the utility of pending motion
                                                                                   • The domain name registered by the domain name registrant
ranking is therefore to determine how many non-triggering motions
                                                                                     is identical or confusingly similar to a trademark or ser-
a clerk would have to consider if the clerk read each motion in
                                                                                     vice mark in which the complainant (the person or entity
rank order until every motion ruled on by the order is found. One
                                                                                     bringing the complaint) has rights; and
way to express this quantity is as mean precision at 100% recall.
                                                                                   • The domain name registrant has no rights or legitimate
In the test set described above, using text normalization and local
                                                                                     interests in respect of the domain name in question; and
TF/IDF training, mean precision at 100% recall was 0.83, indicating
                                                                                   • The domain name has been registered and is being used in
that the number of motions that a clerk would have to be read was
                                                                                     bad faith
significantly reduced.
                                                                               WIPO decisions have a very consistent structure, including sec-
5   DECISION PREDICTION                                                    tions for History, Background, Contentions, Findings, and Decision.
                                                                           Just two distinct decisions are possible: transferring the domain
Predictive models of decision making could be useful to pro se liti-
                                                                           name or denying the complaint. As a result, the decisions are well
gants (for help in understanding the strength of a case), to attorneys
                                                                           suited to the supervised concept-learning approach described above.
(for help in making strategic litigation decisions), and for training
and decision support for judges and other decision makers. Even
                                                                           5.1     Experimental Design
if the accuracy of predictive models were only approximate, they
could nevertheless be valuable for decision support by helping to          Six thousand six hundred WIPO decisions were downloaded and
identify the most relevant words, phrases, or other features of a case     parsed into the five sections described above. Each decision was
record and the most relevant previous decisions.                           labeled TRUE or FALSE based on whether the decision transferred
    Highly-accurate predictive models would require very detailed          the domain name (TRUE) or denied the claim (FALSE). The result-
linguistic analysis of the text of case records and decisions, including   ing set of cases had significant class skew, with 6,000 instances of
argument structure, narrative analysis, etc. [8]. However, predictive      TRUE but only 500 instances of FALSE. For a preliminary study, the
models induced from simpler lexical features may be sufficiently           500 random TRUE instances were subsampled to create a balanced
accurate to be useful for the tasks listed above. Inducing such mod-       test set with 500 instances of each category.
els can be cast as supervised concept learning over corpora of case           This balanced set of cases was converted into a series of test sets
records and decisions, where each decision is treated as a category la-    differing in which sections were included as the text of each instance.
bel for the corresponding case record. This approach is feasible only      The sections tested were as follows:
for simple and routine cases for which it is possible to enumerate a              • History
small set of category labels, such as granting or denying a specific              • Background
benefit or form of relief. However, such simple and routine cases are             • Contentions
characteristic of many forms of administrative adjudication, such as       3 http://www.wipo.int/amc/en/domains/decisions.html
immigration status and benefits entitlement.                               4 https://www.icann.org/resources/pages/policy-2012-02-25-en
ASAIL 2017, June 16, 2017, London, UK                                                                                         L. Karl Branting




Figure 7: Mean f-measure in ten-fold cross-validation with Sup-
port Vector Machine prediction of WIPO case outcomes.


       • The concatenation of History, Background, and Contentions
       • Findings
The text of each instance was normalized by standardizing case and
removing punctuation and, in addition, either (1) removing stop
words, or (2) retaining stop words but replacing dates and numbers
with standard tokens (“NUMBER" or “DATE"). The test condition
in which the text consists of Findings is included for completeness,
although it is not a good proxy for the case record as it contains
conclusions about the facts.
   For each selection of case sections and standardization method,        Figure 8: A subset of high information-gain terms in WIPO for
the text was converted into n-gram frequency vectors for n=1–4, with      History/Background/Contentions instances.
only those n-grams retained that occur at least 8 times. The result was
converted into sparse arff format,5 loaded in Weka, and evaluated
in 10-fold cross-validation using Weka’s implementation support           excerpt shows that phrases concerning filing, failure to submit a re-
vector machine (SVM) with sequential minimal optimization.                sponse, notification, and default are particularly strongly associated
                                                                          with the outcome of the case. One may view high information-gain
Table 2: Mean f-measure in ten-fold cross-validation with Sup-            phrases as being similar to the factors in [2] with the difference that
port Vector Machine prediction of WIPO case outcomes. The                 they are induced automatically rather than being crafted manually.
text of each instance consists of the History (H), Background             The SVM decision surface represents the set of tradeoffs among
(B), Contentions (C), all three (HBC) or Findings (F) section.            these factors that is most consistent with the training data, in a man-
                                                                          ner reminiscent of [6], but without the necessity of domain-specific
                          H          B            C       HBC     F       hand-engineered factors.
        stopwords         0.943      0.758        0.822   0.950   0.955      WIPO domain name dispute cases may be particularly conducive
        nums/dates        0.902      0.750        0.813   0.948   0.960   to predictive modeling owing to their binary outcomes and relatively
                                                                          stereotypical fact patterns. This experiment does not address the
                                                                          differences between the case record and the facts as summarized in
                                                                          the decision document, and the evaluation above artificially dimin-
5.2     Experimental Results                                              ished the effect of class skew by subsampling to produce a balanced
As set forth in Table 2 and Figure 7, the greatest predictive accuracy    test set. Nevertheless, the impressive accuracy of a predictive model
was achieved by the combination of the History, Background, and           trained on raw text without any feature design or knowledge en-
Contentions sections of each case (HBC). The predictive accuracy          gineering suggests that this approach may have great promise for
from these three sections, f-measure of roughly 0.95, was almost as       increasing access to justice for pro se litigants and improving train-
high as the accuracy of prediction based on the text of the Findings      ing and decision support for decision makers in domains with many
section.                                                                  routine adjudications.
   To understand why the HBC text is so predictive, it is helpful
to examine the terms with the highest mutual information with the         6   RELATED WORK
concept to be predicted, some of which are shown in Figure 8. This        The history of applying text classification techniques to legal doc-
5 http://www.cs.waikato.ac.nz/ml/weka/arff.html                           uments dates back at least to the 1970s [4]. Text classification has
Automating Judicial Document Analysis                                                                           ASAIL 2017, June 16, 2017, London, UK


been recognized as of particular importance for electronic discov-                   [2] V. Aleven and K. Ashley. Doing things with factors. In Proceedings of the
ery [18]. Little prior work has addressed classification of docket                       Third European Workshop on Case-Based Reasoning (EWCR-96), pages 76–90,
                                                                                         Lausanne, Switzerland, November 1996.
entries other than Nallapati and Manning [14], which achieved an                     [3] R. Battiti. Using mutual information for selecting features in supervised neural
f-measure of 0.8967 in distinguishing Orders to Show Cause from                          net learning. IEEE Transactions on Neural Networks, 5(4):537–550, Jul 1994.
                                                                                     [4] J. Boreham and B. Niblett. Classification of legal texts by computer. Information
other document types using a hand-engineered feature set.                                Processing & Management, 12(2):125 – 132, 1976.
   There is extensive current activity in predictive models trained                  [5] L. Breiman. Random forests. Mach. Learn., 45(1):5–32, Oct. 2001.
on factors unrelated to the merits of the case such as the nature                    [6] S. Brüninghaus and K. Ashley. Generating legal arguments and predictions from
                                                                                         case texts. In Proceedings of the Tenth International Conference on Artificial
of suit, attorneys, forum, judge, and parties [19]. Recent startups                      Intelligence & Law (ICAIL-05), pages 65–74, Bologna, Italy, June 6–11 2005.
marketing predictive models for litigation support based on non                      [7] L. I. I. Cornell University Law School. The federal rules of civil procedure.
merits-based factors include Lex Machina [11], LexPredict [12],                          https://www.law.cornell.edu/rules/FRCP.
                                                                                     [8] D. Gutfreund, Y. Katz, and N. Slonim. Automatic arguments construction–from
and Premonition [17]. The insurance industry has a long history of                       search engine to research engine. In 2016 AAAI Fall Symposium Series, 2016.
developing decision prediction based on the merits of a claim, but                   [9] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The
                                                                                         weka data mining software: An update. SIGKDD Explorations, 11(1), 2009.
these models are typically manually constructed, e.g., [15]. Outcome                [10] S. Keerthi, S. Shevade, C. Bhattacharyya, and K. Murthy. Improvements to platt’s
prediction based the merits of the case as extracted directly from raw                   smo algorithm for svm classifier design. Neural Computation, 13(3):637–649,
text is a relatively new research area, with little work outside of [1].                 2001.
                                                                                    [11] Lex machina. https://lexmachina.com/ [Accessed: 27 November 2016].
                                                                                    [12] Lexpredict. https://lexpredict.com/ [Accessed: 29 November 2016].
                                                                                    [13] R. E. Madsen, S. Sigurdsson, L. K. Hansen, and J. Larsen. Pruning the vocab-
7    SUMMARY AND FUTURE WORK                                                             ulary for better context recognition. In Pattern Recognition, 2004. ICPR 2004.
Judicial document collections contain a rich trove of potential infor-                   Proceedings of the 17th International Conference on, volume 2, pages 483–488.
                                                                                         IEEE, 2004.
mation, but analyzing these documents presents many challenges.                     [14] R. Nallapati and C. D. Manning. Legal docket-entry classification: Where machine
This paper has demonstrated how many types of filing error detection                     learning stumbles. In Proceedings of the Conference on Empirical Methods in
can be formulated as text classification problems. The highest accu-                     Natural Language Processing, EMNLP ’08, pages 438–446, Stroudsburg, PA,
                                                                                         USA, 2008. Association for Computational Linguistics.
racy was obtained by combining lexical features that characterize                   [15] M. Peterson and D. Waterman. Rule-based models of legal expertise. In C. Walters,
the document itself with procedural context features that indicate the                   editor, Computing Power and Legal Reasoning, pages 627–659. West Publishing
                                                                                         Company, Minneapolis, Minnesota, 1985.
role that the document is intended to play. These results demonstrate               [16] J. C. Platt. Fast training of support vector machines using sequential minimal
the feasibility of automating portions of the process of auditing court                  optimization. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances
submissions, which could significant reduce a persistentdrain on                         in Kernel Methods, pages 185–208. MIT Press, Cambridge, MA, USA, 1999.
                                                                                    [17] Premonition. https://premonition.ai/ [Accessed: 27 November 2016].
court resources.                                                                    [18] H. L. Roitblat, A. Kershaw, and P. Oot. Document categorization in legal electronic
    The experiment with order/motion matching demonstrates that                          discovery: computer classification vs. manual review. Journal of the American
while term selection may improve accuracy for document classifica-                       Society for Information Science and Technology, 61(1):70–80, 2010.
                                                                                    [19] M. Surdeanu, R. Nallapati, G. Gregory, J. Walker, and C. Manning. Risk analysis
tion, it can decrease accuracy for tasks that involve matching based                     for intellectual property litigation. In Proceedings of the Thirteenth International
on overall similarity rather than procedural similarity.                                 Conference on Artificial Intelligence and Law, Pittsburgh, PA, June 6–10 2011.
                                                                                         ACM.
    The demonstration of outcome prediction in WIPO decisions                       [20] Tesseract. https://en.wikipedia.org/wiki/Tesseract_(software).
illustrates that for case corpora with a limited set of possible out-
comes and relatively stereotypical fact patterns, decision models
of impressive accuracy can be induced without hand-engineered
features, simply from the fact descriptions. This approach may be
particularly promising for decision support and improved access to
justice in the simpler and more routine end of the judicial spectrum.
    No single technology is applicable to all judicial documents, nor
is any approach sufficient for all document analysis tasks. However,
each addition to this suite of technologies adds to the capabilities
available to the courts, government agencies, and citizens to exploit
the deep well of information latent in judicial document corpora.

ACKNOWLEDGMENT
The MITRE Corporation is a not-for-profit Federally Funded Re-
search and Development Center chartered in the public interest. This
document is approved for Public Release, Distribution Unlimited,
Case Number 17-0362. ©2017 The MITRE Corporation. All rights
reserved.

REFERENCES
 [1] N. Aletras, D. Tsarapatsanis, D. Preotiuc-Pietro, and V. Lampos. Predicting
     judicial decisions of the European Court of Human Rights: a natural language
     processing perspective. PeerJ CompSci, October 24 2016. https://peerj.com/
     articles/cs-93/.