<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Automating Judicial Document Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>L. Karl Branting The MITRE Corporation</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Colshire Drive McLean</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>USA lbranting@mitre.org</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>16</volume>
      <issue>2017</issue>
      <abstract>
        <p>Collections of documents filed in courts are potentially a rich source of information for citizens, attorneys, and courts, but courts typically lack the ability to interpret them automatically. This paper presents technical approaches to three applications of judicial document interpretation: detection of document filing errors; matching orders with the motions that they resolve; and predicting the outcome of routine cases. In empirical evaluations on filings from two representative large US District Courts, the highest accuracy in identifying filing errors was achieved by combining procedural context features with high information-gain lexical features; TF/IDF similarity was found to be an effective criterion for finding motions that correspond to orders; and induction over the texts of prior simple and routine decisions was found to produce a model capable of accurately predicting outcomes from case facts without any manually engineered features or factors.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>The transition from paper to electronic filing in national, local, and
administrative courts, which began in the late 1990s, has transformed
how courts operate and how judges, court staff, attorneys, and the
public create, submit, and access court filings. However, despite
many advances in judicial access and administration brought about
by electronic filing, courts are typically unable to interpret the
contents of court filings automatically. Instead, court filings are
interpreted only when they are read by an attorney, judge, or court staff
member.</p>
      <p>Machine interpretation of court filings promises a rich source
of information for improving court administration and case
management, access to justice, and analysis of the judiciary. The
development of large-scale text analytics makes such interpretation
increasingly feasible, as collections of court documents are, in
effect, annotated by the metadata generated when they are submitted,
by corrections when they are audited, or, for those documents that
are motions or claims, by the decisions of judges or other decision
makers.</p>
      <p>However, there are numerous challenges to automating the
interpretation of case filings. Courts often accept documents in the
form of PDFs created from scans. Scanned PDFs require optical
character recognition (OCR) for text extraction, but this process
introduces many errors and does not preserve the document layout,
which contains important information about the relationships among
text segments in the document. Moreover, the language of court
iflings is complex and specialized, and the function of a court filing
depends not just on its text and format, but also on its procedural
context. As a result, successful automation of court filings requires
overcoming a combination of technical challenges.</p>
      <p>This paper describes the nature of court dockets and databases,
sets forth three classes of representative judicial document analysis
tasks–docket error detection, order/motion matching, and decision
prediction–proposes technical approaches to each of the tasks, and
presents preliminary empirical evaluations of the effectiveness of
each approach.
2</p>
    </sec>
    <sec id="sec-2">
      <title>COURT DOCKETS AND DATABASES</title>
      <p>A court docket is a register of document-triggered litigation events,
where a litigation event consists of either (1) a pleading, motion, or
letter from a litigant, (2) an order, judgment, or other action by a
judge, or (3) a record of an administrative action (such as notifying an
attorney of a filing error) by a member of the court staff. Each docket
event in a typical electronic case management system includes (1)
metadata generated at the time of filing, including both case-specific
data (e.g., case number, parties, judge) and event-specific data (e.g.,
the attorney submitting the document, the intended document type)
and (2) a text document in PDF format. Each of the two court
databases in which the experiments described below were performed
contained filings for over 400,000 cases involving over 1,000,000
litigants, attorneys, and judges, over 10,000,000 docket entries, and
more than 4,000,000 documents.
3</p>
    </sec>
    <sec id="sec-3">
      <title>DOCKET ERROR DETECTION</title>
      <p>There are many kinds of docket errors, including defects in a
submitted document (e.g., missing signature, sensitive information in an
unsealed document, missing case caption) and mismatches between
the content of a document and the context of the case (e.g., wrong
parties, case number, or judge; mismatch between the document title
and the document type asserted by the user). Some errors consist of
violations of a particular court’s local rules and are therefore unique
to that court. Other events, such as filing in a wrong case, constitute
errors in any court. In either case, detection of defects at submission
time could spare attorneys the embarrassment of submitting a
defective document and the inconvenience and delays of refiling. For
court staff, automated filing error detection could reduce the quality
control (QC) auditing staff required for filing errors, a significant
drain of resources in many courts.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Error Detection through Text Classification</title>
      <p>In the court in which the first set of experiments were conducted, the
QC staff review filings to detect a variety of docket errors, including
the following four error types:
• Event-type errors, i.e., specifying the wrong event type
for a document, e.g., submitting a Motion for Summary
Judgment as a Counterclaim. In experiments involving this
court, there were 20 event types, such as complaint, transfer,
notice, order, service, etc.
• Main-vs-attachment errors, i.e., filing a document, such as
an exhibit, that should be filed as an attachment to another
document, as a main document or filing a document, such
as a Memorandum in Support of a Motion for Summary
Judgment, that should be lfied as a main document, as an
attachment.
• Show-cause order errors. In some courts, only judges are
permitted to file show-cause orders; it is an error if an
attorney does so.
• Letter-motion errors. In some courts, certain routine
motions can be filed as letters, but all other filings must have
a formal caption. Recognizing these errors requires
distinguishing letters from non-letters.</p>
      <p>Event-type errors appear to be the most common docket errors in
U.S. District courts.</p>
      <p>Each of these filing errors can be detected by classifying a
document with respect to the corresponding set of categories (event
type, main vs. attachment, show-cause order vs. non-show-cause
order, or letter vs. non-letter) and evaluating whether the category is
consistent with the metadata generated in the docket system by the
ifler’s selections. Event-type document classification is particularly
challenging both because document types are both numerous and
skewed, having a roughly power-law frequency distribution in the
test set.</p>
      <p>The first set of experiments attempted to identify each of the four
docket errors above by classifying document text and determining
whether there is a conflict between the apparent text category and
the document’s metadata. Classification was performed with the
lingpipe1 LMClassifier, which performs joint probability-based
classification of token sequences into non-overlapping categories based
on language models for each category and a multivariate distribution
over categories.</p>
      <p>3.1.1 Term Selection and Document Truncation. Court
filings can be thought of as comprising four distinct sets of terms:
• Procedural words, which describe the intended legal
function of the document (e.g., “complaint,” “amended,”
“counsel”)
• “Stop-words," which are non-content common words, such
as “of” and “the”
• Words unique to the case, such as names, and words
expressing the narrative events giving rise to the case; and
• Substantive (as opposed to procedural) legal terms (e.g.,
“reasonable care,” “intent,” “battery”).</p>
      <p>Terms in the first of these sets–procedural words–carry the most
information about the type of the document. These words tend to
be concentrated around the beginning of legal documents, often in
the case caption, and at the end, where distinctive phrases like “so
ordered” may occur.
1http://alias-i.com/lingpipe/</p>
      <p>
        We hypothesized that only procedure terms are relevant to the
type of a document, so we explored approaches to filtering
nonprocedure terms. Elimination of irrelevant terms can not only speed
execution, but in some cases has been shown to increase accuracy
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Three approaches to term selection were investigated: two ad hoc
and domain-specific; and one general and domain-independent. The
ifrst approach was to eliminate all terms except non-stopwords that
occur in the Federal Rules of Civil Procedure [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. A related
alternative approach was to remove all terms except for non-stopwords
occurring in “event” (i.e., document) descriptions typed by filers
when they submit into the docket system. The third approach was to
select terms based on their mutual information with each particular
text categories [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The first lexical set, termed FRCP, contains 2658
terms; the second, termed event, consists of 513 terms. Separate
mutual-information sets were created for each classification task,
reflecting the fact that the information gain from a term depends on
the category distribution of the documents. For example, Figure 1
shows the 10 highest information terms for three different
classification tasks: event-type classification, distinguishing letters from non
letters, and show-cause order detection, illustrating that the most
informative terms differ widely depending on the classification task.
      </p>
      <p>Figure 2 illustrates the reduction of full document text to just high
information gain terms, which typifies the vocabulary-reduction
process.</p>
      <p>Several approaches to document truncation were explored as well.
The first was to limit the text to the first l tokens of the document
(i.e., excise the remainder of the document). If l is sufficiently large,
this is equivalent to including the entire document. A second option
is to include the last l tokens of the suffix as well as the prefix.</p>
      <p>
        3.1.2 Evaluation of Alternative Term Reduction Approaches.
Two different information-gain thresholds were tested for each
classification type, intended to create one small set of very-high
information terms (ig_small) and a larger set created using a lower
threshold (ig_large). The thresholds and sizes of the large and small
high information-gain term sets are set forth in Table 1. The text of
each document was obtained by OCR using the open-source program
Tesseract [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Each text was normalized by removing non-ASCII
characters and standardizing case prior to term selection, if any.
      </p>
      <p>Figure 3 shows a comparison of four vocabulary alternatives on
the four text classification tasks described above. These tests
measured mean f-measure in 8-fold cross validation using a 4-gram
language mode, 50-token prefix length, and no suffix. In the baseline
vocabulary set, normalize, non-ASCII characters, numbers, and
punctuation are removed and tokens were lower-cased. The results
show that classification accuracy using an unreduced vocabulary was
significantly lower than the best reduced vocabulary performance for
show-cause order detection and type classification. Term selection
had little effect on accuracy for the letter and main vs. attachment
detection tasks. No reduced-vocabulary set consistently outperformed
the others. This indicates that restricted term sets derived through
information gain perform roughly as well as those produced using
domain-specific information, suggesting that the reduced vocabulary
approach is appropriate for situations in which domain-specific term
information is unavailable.</p>
      <p>Summarizing over the tests, the the highest mean f-measure based
on text classification alone and the particular combination of
parameters that led to this accuracy for each classification task were as
follows:
(1) Event type: 0.743 (prefix=50, 4-gram ig_large vocabulary,
20 categories)
(2) Main-vs-attachment: 0.871 (prefix=256, 6-gram, event
vocabulary)
(3) Show-cause order: 0.957 (prefix=50, 5-gram, ig_small
vocabulary)
(4) Letter-vs-non-letter: 0.889 (prefix=50, no 4-gram, ig_large
vocabulary)
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Incorporating Procedural Context Features</title>
      <p>The accuracy of event-type detection (f-measure of roughly 0.743
under the best combinations of parameters) is sufcfiiently low that
its utility for many auditing functions may be limited. An analysis
of the classification errors produced by the event-type text
classiifcation model indicated that a document’s event type depends not
just on the text of the document but also on its procedural context.
For example, motions and orders are sometimes extremely similar
because judges grant a motion by adding and signing an order stamp
to the motion. Since stamps and signatures are seldom accurately
OCR’d, the motion and order may be indistinguishable by the text
alone under these circumstances. However, orders can be issued only
by a judge, and judges never file motions, so the two cases can be
distinguished by knowing the filer. In addition, attachments have the
same event type as the main document in CM/ECF. So, for example,
a memorandum of law is ordinarily a main document, but in some
courts a memorandum can be filed as an attachment, in which case
its event type is the same as that of the main document to which it is
attached.</p>
      <p>Contextual information potentially relevant to a document’s type
includes: whether it was filed as a main document or as an
attachment; the filer (e.g., attorney, clerk, judge); the type of the case (e.g.,
criminal, civil, multi-district); and the document length (e.g.,
memoranda are typically long; minute orders typically short). Combining
these non-lexical features with text features requires a different
classifier than the language-model classifier used in the first set of
experiments.</p>
      <p>
        We compared the performance of SupportVector Machine (SVM)
learning (WEKA’s implementation of Platt’s algorithm for sequential
minimal optimization [
        <xref ref-type="bibr" rid="ref10 ref16">10, 16</xref>
        ]) and Random Forests [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], both in the
WEKA [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] implementation, on the task of filing event classification.
For each filing event, the document text was normalized by filtering
stop words, normalizing dates and numbers to standard tokens, and
replacing each instance of a party name with the role of that party
(e.g., DFT, PTF). The result was combined with the contextual
features and converted into a sparse n-gram frequency vector from
which the {1,2,4,8} thousand highest information gain features were
selected (unsurprisingly, the contextual features always had higher
information gain than any lexical feature). The training set consisted
of 28,763 main documents having 43 distinct types representing 2
month’s filings in a large US District court.
      </p>
      <p>As shown in Figure 4, the SVM was consistently more accurate,
with little variation in accuracy as a function of the number of
features, although accuracy was slightly higher at 4,000 features
than other feature set sizes. By contrast, the accuracy of the random
forest diminished with increasing numbers of features. The highest
accuracy SVM configuration, f-measure of 0.926, was much higher
than the maximum observed with text-only classification (albeit, in
a different court). This suggests that including procedural context
features is essential for accurate document filing type identification
in judicial databases, and that algorithms that can handle both textual
and categorical features should be used for this task.
4</p>
    </sec>
    <sec id="sec-6">
      <title>ORDER/MOTION MATCHING</title>
      <p>In many federal courts, docket clerks are responsible for filing orders
executed by judges into the docket system, a process that requires the
clerk to identify all pending motions to which the order responds and
to link the order to those motions. This entails reading all pending
motions, a tedious task. If the motions corresponding to an order
could be identified automatically, docket clerks would be relieved
of this laborious task. Even ranking the motions by their likelihood
of being resolved by a given order would decrease the burden on
docket clerks. Moreover, order/motion matching is a subtask of a
more general issue-chaining problem, which consists of identifying
the sequence of preceding and subsequent documents relevant to a
given document.</p>
      <p>A straightforward approach to this task is to treat order/motion
matching as an information-retrieval task, under the hypothesis that
an order is likely to have a higher degree of similarity to its
corresponding motions than to motions that it does not rule on. An
obvious approach is to present pending motions to the clerk in rank
order of their TF/IDF2-weighted cosine similarity to the order.</p>
      <p>The evaluation above showing that term selection improves
document classification raises the question whether term selection might
be beneficial for order/motion matching as well. A second question
is whether the IDF motion should be trained on an entire corpus of
motions and orders or whether acceptable accuracy can be obtained
by training just on the order and pending motions.</p>
      <p>To evaluate the effectiveness of this approach to order/motion
match, a subset of the document set described above was collected
consisting of 3,356 groups, each comprising (1) an order, (2) a
motion that the order rules on (a triggering motion), and (3) a
nonempty set of all motions that were pending at the time of the order but
not ruled on by the order (non-triggering motions). The mean number
of motions per group was 5.87 (i.e., there were on average 4.87
non-triggering motions). For each group, all motions were ranked
by similarity to the order under the given metric. The proportion
of triggering motions that were ranked first and mean rank of the
triggering motion were calculated from each group’s ranking.</p>
      <p>These groups were evaluated using three term selection approaches:
the raw document text (which often contains many OCR errors);
normalization, as described above; and event terms. The two
alternative TF/IDF training models were applied to each of the three
term selection approaches, for a total of 6 combinations. For each
combination, the mean rank of the triggering motion among all the
motions was determined.</p>
      <p>Figure 5 shows that the highest accuracy, as measured by the
proportion of triggering motions that were ranked first among all
pending motions, was achieved by normalizing the text without
term selection. Intuitively, reduction to procedurally relevant terms
improves the ability to determine what docket event a document
performs, but can reduce the ability to discern the similarity between
corresponding pairs of documents. TF/IDF training on just the order
2Term Frequency/Inverse Document Frequency
and pending motions (local) is at least as accurate as training over
all orders and motions (all). Figure 6 shows the mean rank (zero
indexed) of the most similar motion under each of the six conditions.
The best (lowest) mean rank was achieved with normalization and
local TF/IDF training.</p>
      <p>It is not unusual for a single order to rule on multiple pending
motions. A more realistic assessment of the utility of pending motion
ranking is therefore to determine how many non-triggering motions
a clerk would have to consider if the clerk read each motion in
rank order until every motion ruled on by the order is found. One
way to express this quantity is as mean precision at 100% recall.
In the test set described above, using text normalization and local
TF/IDF training, mean precision at 100% recall was 0.83, indicating
that the number of motions that a clerk would have to be read was
significantly reduced.
5</p>
    </sec>
    <sec id="sec-7">
      <title>DECISION PREDICTION</title>
      <p>Predictive models of decision making could be useful to pro se
litigants (for help in understanding the strength of a case), to attorneys
(for help in making strategic litigation decisions), and for training
and decision support for judges and other decision makers. Even
if the accuracy of predictive models were only approximate, they
could nevertheless be valuable for decision support by helping to
identify the most relevant words, phrases, or other features of a case
record and the most relevant previous decisions.</p>
      <p>
        Highly-accurate predictive models would require very detailed
linguistic analysis of the text of case records and decisions, including
argument structure, narrative analysis, etc. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. However, predictive
models induced from simpler lexical features may be sufcfiiently
accurate to be useful for the tasks listed above. Inducing such
models can be cast as supervised concept learning over corpora of case
records and decisions, where each decision is treated as a category
label for the corresponding case record. This approach is feasible only
for simple and routine cases for which it is possible to enumerate a
small set of category labels, such as granting or denying a specific
benefit or form of relief. However, such simple and routine cases are
characteristic of many forms of administrative adjudication, such as
immigration status and benefits entitlement.
      </p>
      <p>
        Unfortunately, in many simple and routine administrative
domains, only the decisions themselves, but not the underlying case
records, are available. However, in such cases, the statement of facts
in the decision can be used as a proxy for the contents of the
corresponding case record. This approach was applied to decisions of
the European Court of Human Rights in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which found that case
outcomes could be predicted to some degree from statements of fact.
The predictability of case outcomes from the statement of facts in
the decision document doesn’t conclusively demonstrate that the
outcome would be equally predictable from the raw case record; the
decision maker’s description of the facts may have been tailored to
ift the outcome. However, a demonstration that case outcomes can
be predicted to some extent by models trained from fact statements
alone may encourage courts and agencies to experiment with this
approach to creating decision-support tools for pro se litigants and
decision makers.
      </p>
      <p>Accordingly, an experiment was performed to evaluate the
feasibility of predicting decisions from the fact statements of cases
in representative domain: World Intellectual Property Organization
(WIPO) domain name decisions.3 Domain name decisions resolve
disputes between a domain name registrant and a third party under
the Uniform Domain Name Dispute Resolution Policy (UDRP).4
The UDRP Administrative Procedure applies to disputes concerning
an alleged abusive registration of a domain name under the following
criteria:
• The domain name registered by the domain name registrant
is identical or confusingly similar to a trademark or
service mark in which the complainant (the person or entity
bringing the complaint) has rights; and
• The domain name registrant has no rights or legitimate
interests in respect of the domain name in question; and
• The domain name has been registered and is being used in
bad faith</p>
      <p>WIPO decisions have a very consistent structure, including
sections for History, Background, Contentions, Findings, and Decision.
Just two distinct decisions are possible: transferring the domain
name or denying the complaint. As a result, the decisions are well
suited to the supervised concept-learning approach described above.
5.1</p>
    </sec>
    <sec id="sec-8">
      <title>Experimental Design</title>
      <p>Six thousand six hundred WIPO decisions were downloaded and
parsed into the vfie sections described above. Each decision was
labeled TRUE or FALSE based on whether the decision transferred
the domain name (TRUE) or denied the claim (FALSE). The
resulting set of cases had significant class skew, with 6,000 instances of
TRUE but only 500 instances of FALSE. For a preliminary study, the
500 random TRUE instances were subsampled to create a balanced
test set with 500 instances of each category.</p>
      <p>This balanced set of cases was converted into a series of test sets
differing in which sections were included as the text of each instance.
The sections tested were as follows:
• History
• Background
• Contentions
3http://www.wipo.int/amc/en/domains/decisions.html
4https://www.icann.org/resources/pages/policy-2012-02-25-en
• The concatenation of History, Background, and Contentions
• Findings
The text of each instance was normalized by standardizing case and
removing punctuation and, in addition, either (1) removing stop
words, or (2) retaining stop words but replacing dates and numbers
with standard tokens (“NUMBER" or “DATE"). The test condition
in which the text consists of Findings is included for completeness,
although it is not a good proxy for the case record as it contains
conclusions about the facts.</p>
      <p>For each selection of case sections and standardization method,
the text was converted into n-gram frequency vectors for n=1–4, with
only those n-grams retained that occur at least 8 times. The result was
converted into sparse arff format,5 loaded in Weka, and evaluated
in 10-fold cross-validation using Weka’s implementation support
vector machine (SVM) with sequential minimal optimization.</p>
    </sec>
    <sec id="sec-9">
      <title>Experimental Results</title>
      <p>As set forth in Table 2 and Figure 7, the greatest predictive accuracy
was achieved by the combination of the History, Background, and
Contentions sections of each case (HBC). The predictive accuracy
from these three sections, f-measure of roughly 0.95, was almost as
high as the accuracy of prediction based on the text of the Findings
section.</p>
      <p>
        To understand why the HBC text is so predictive, it is helpful
to examine the terms with the highest mutual information with the
concept to be predicted, some of which are shown in Figure 8. This
5http://www.cs.waikato.ac.nz/ml/weka/arff.html
excerpt shows that phrases concerning filing, failure to submit a
response, notification, and default are particularly strongly associated
with the outcome of the case. One may view high information-gain
phrases as being similar to the factors in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] with the difference that
they are induced automatically rather than being crafted manually.
The SVM decision surface represents the set of tradeoffs among
these factors that is most consistent with the training data, in a
manner reminiscent of [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], but without the necessity of domain-specific
hand-engineered factors.
      </p>
      <p>WIPO domain name dispute cases may be particularly conducive
to predictive modeling owing to their binary outcomes and relatively
stereotypical fact patterns. This experiment does not address the
differences between the case record and the facts as summarized in
the decision document, and the evaluation above artificially
diminished the effect of class skew by subsampling to produce a balanced
test set. Nevertheless, the impressive accuracy of a predictive model
trained on raw text without any feature design or knowledge
engineering suggests that this approach may have great promise for
increasing access to justice for pro se litigants and improving
training and decision support for decision makers in domains with many
routine adjudications.
6</p>
    </sec>
    <sec id="sec-10">
      <title>RELATED WORK</title>
      <p>
        The history of applying text classification techniques to legal
documents dates back at least to the 1970s [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Text classification has
been recognized as of particular importance for electronic
discovery [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Little prior work has addressed classification of docket
entries other than Nallapati and Manning [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], which achieved an
f-measure of 0.8967 in distinguishing Orders to Show Cause from
other document types using a hand-engineered feature set.
      </p>
      <p>
        There is extensive current activity in predictive models trained
on factors unrelated to the merits of the case such as the nature
of suit, attorneys, forum, judge, and parties [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Recent startups
marketing predictive models for litigation support based on non
merits-based factors include Lex Machina [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], LexPredict [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
and Premonition [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The insurance industry has a long history of
developing decision prediction based on the merits of a claim, but
these models are typically manually constructed, e.g., [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Outcome
prediction based the merits of the case as extracted directly from raw
text is a relatively new research area, with little work outside of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
7
      </p>
    </sec>
    <sec id="sec-11">
      <title>SUMMARY AND FUTURE WORK</title>
      <p>Judicial document collections contain a rich trove of potential
information, but analyzing these documents presents many challenges.
This paper has demonstrated how many types of filing error detection
can be formulated as text classification problems. The highest
accuracy was obtained by combining lexical features that characterize
the document itself with procedural context features that indicate the
role that the document is intended to play. These results demonstrate
the feasibility of automating portions of the process of auditing court
submissions, which could significant reduce a persistentdrain on
court resources.</p>
      <p>The experiment with order/motion matching demonstrates that
while term selection may improve accuracy for document
classification, it can decrease accuracy for tasks that involve matching based
on overall similarity rather than procedural similarity.</p>
      <p>The demonstration of outcome prediction in WIPO decisions
illustrates that for case corpora with a limited set of possible
outcomes and relatively stereotypical fact patterns, decision models
of impressive accuracy can be induced without hand-engineered
features, simply from the fact descriptions. This approach may be
particularly promising for decision support and improved access to
justice in the simpler and more routine end of the judicial spectrum.</p>
      <p>No single technology is applicable to all judicial documents, nor
is any approach sufficient for all document analysis tasks. However,
each addition to this suite of technologies adds to the capabilities
available to the courts, government agencies, and citizens to exploit
the deep well of information latent in judicial document corpora.</p>
    </sec>
    <sec id="sec-12">
      <title>ACKNOWLEDGMENT</title>
      <p>The MITRE Corporation is a not-for-profit Federally Funded
Research and Development Center chartered in the public interest. This
document is approved for Public Release, Distribution Unlimited,
Case Number 17-0362. ©2017 The MITRE Corporation. All rights
reserved.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Aletras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tsarapatsanis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Preotiuc-Pietro</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Lampos</surname>
          </string-name>
          .
          <article-title>Predicting judicial decisions of the European Court of Human Rights: a natural language processing perspective</article-title>
          .
          <source>PeerJ CompSci, October</source>
          <volume>24</volume>
          2016. https://peerj.com/ articles/cs-93/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Aleven</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Ashley</surname>
          </string-name>
          .
          <article-title>Doing things with factors</article-title>
          .
          <source>In Proceedings of the Third European Workshop on Case-Based Reasoning (EWCR-96)</source>
          , pages
          <fpage>76</fpage>
          -
          <lpage>90</lpage>
          , Lausanne, Switzerland,
          <year>November 1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Battiti</surname>
          </string-name>
          .
          <article-title>Using mutual information for selecting features in supervised neural net learning</article-title>
          .
          <source>IEEE Transactions on Neural Networks</source>
          ,
          <volume>5</volume>
          (
          <issue>4</issue>
          ):
          <fpage>537</fpage>
          -
          <lpage>550</lpage>
          ,
          <year>Jul 1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Boreham</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Niblett</surname>
          </string-name>
          .
          <article-title>Classification of legal texts by computer</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>12</volume>
          (
          <issue>2</issue>
          ):
          <fpage>125</fpage>
          -
          <lpage>132</lpage>
          ,
          <year>1976</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Breiman</surname>
          </string-name>
          .
          <article-title>Random forests</article-title>
          . Mach. Learn.,
          <volume>45</volume>
          (
          <issue>1</issue>
          ):
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          , Oct.
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Brüninghaus</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Ashley</surname>
          </string-name>
          .
          <article-title>Generating legal arguments and predictions from case texts</article-title>
          .
          <source>In Proceedings of the Tenth International Conference on Artificial Intelligence &amp; Law (ICAIL-05)</source>
          , pages
          <fpage>65</fpage>
          -
          <lpage>74</lpage>
          , Bologna, Italy, June 6-11
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>L. I. I</surname>
          </string-name>
          . Cornell University Law School.
          <article-title>The federal rules of civil procedure</article-title>
          . https://www.law.cornell.edu/rules/FRCP.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gutfreund</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Katz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Slonim</surname>
          </string-name>
          .
          <article-title>Automatic arguments construction-from search engine to research engine</article-title>
          .
          <source>In 2016 AAAI Fall Symposium Series</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hall</surname>
          </string-name>
          , E. Frank,
          <string-name>
            <given-names>G.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reutemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          .
          <article-title>The weka data mining software: An update</article-title>
          .
          <source>SIGKDD Explorations</source>
          ,
          <volume>11</volume>
          (
          <issue>1</issue>
          ),
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Keerthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shevade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bhattacharyya</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Murthy</surname>
          </string-name>
          .
          <article-title>Improvements to platt's smo algorithm for svm classifier design</article-title>
          .
          <source>Neural Computation</source>
          ,
          <volume>13</volume>
          (
          <issue>3</issue>
          ):
          <fpage>637</fpage>
          -
          <lpage>649</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <article-title>Lex machina</article-title>
          . https://lexmachina.com/ [Accessed: 27
          <source>November</source>
          <year>2016</year>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Lexpredict</surname>
          </string-name>
          . https://lexpredict.com/ [Accessed: 29
          <source>November</source>
          <year>2016</year>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Madsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sigurdsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. K.</given-names>
            <surname>Hansen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Larsen</surname>
          </string-name>
          .
          <article-title>Pruning the vocabulary for better context recognition</article-title>
          .
          <source>In Pattern Recognition</source>
          ,
          <year>2004</year>
          .
          <article-title>ICPR 2004</article-title>
          .
          <source>Proceedings of the 17th International Conference on</source>
          , volume
          <volume>2</volume>
          , pages
          <fpage>483</fpage>
          -
          <lpage>488</lpage>
          . IEEE,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nallapati</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Legal docket-entry classification: Where machine learning stumbles</article-title>
          .
          <source>In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP '08</source>
          , pages
          <fpage>438</fpage>
          -
          <lpage>446</lpage>
          , Stroudsburg, PA, USA,
          <year>2008</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Peterson</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Waterman</surname>
          </string-name>
          .
          <article-title>Rule-based models of legal expertise</article-title>
          . In C. Walters, editor,
          <source>Computing Power and Legal Reasoning</source>
          , pages
          <fpage>627</fpage>
          -
          <lpage>659</lpage>
          . West Publishing Company, Minneapolis, Minnesota,
          <year>1985</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Platt</surname>
          </string-name>
          .
          <article-title>Fast training of support vector machines using sequential minimal optimization</article-title>
          . In B.
          <string-name>
            <surname>Schölkopf</surname>
            ,
            <given-names>C. J. C.</given-names>
          </string-name>
          <string-name>
            <surname>Burges</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A. J</surname>
          </string-name>
          . Smola, editors,
          <source>Advances in Kernel Methods</source>
          , pages
          <fpage>185</fpage>
          -
          <lpage>208</lpage>
          . MIT Press, Cambridge, MA, USA,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Premonition</surname>
          </string-name>
          . https://premonition.ai/ [Accessed: 27
          <source>November</source>
          <year>2016</year>
          ].
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Roitblat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kershaw</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Oot</surname>
          </string-name>
          .
          <article-title>Document categorization in legal electronic discovery: computer classification vs. manual review</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          ,
          <volume>61</volume>
          (
          <issue>1</issue>
          ):
          <fpage>70</fpage>
          -
          <lpage>80</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Surdeanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nallapati</surname>
          </string-name>
          , G. Gregory,
          <string-name>
            <given-names>J.</given-names>
            <surname>Walker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Risk analysis for intellectual property litigation</article-title>
          .
          <source>In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Law</source>
          , Pittsburgh, PA, June 6-10
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Tesseract</surname>
          </string-name>
          . https://en.wikipedia.org/wiki/Tesseract_(software).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>