Automating Judicial Document Analysis L. Karl Branting The MITRE Corporation 7515 Colshire Drive McLean, VA 22102, USA lbranting@mitre.org ABSTRACT filings is complex and specialized, and the function of a court filing Collections of documents filed in courts are potentially a rich source depends not just on its text and format, but also on its procedural of information for citizens, attorneys, and courts, but courts typically context. As a result, successful automation of court filings requires lack the ability to interpret them automatically. This paper presents overcoming a combination of technical challenges. technical approaches to three applications of judicial document inter- This paper describes the nature of court dockets and databases, pretation: detection of document filing errors; matching orders with sets forth three classes of representative judicial document analysis the motions that they resolve; and predicting the outcome of routine tasks–docket error detection, order/motion matching, and decision cases. In empirical evaluations on filings from two representative prediction–proposes technical approaches to each of the tasks, and large US District Courts, the highest accuracy in identifying filing presents preliminary empirical evaluations of the effectiveness of errors was achieved by combining procedural context features with each approach. high information-gain lexical features; TF/IDF similarity was found to be an effective criterion for finding motions that correspond to 2 COURT DOCKETS AND DATABASES orders; and induction over the texts of prior simple and routine deci- A court docket is a register of document-triggered litigation events, sions was found to produce a model capable of accurately predicting where a litigation event consists of either (1) a pleading, motion, or outcomes from case facts without any manually engineered features letter from a litigant, (2) an order, judgment, or other action by a or factors. judge, or (3) a record of an administrative action (such as notifying an attorney of a filing error) by a member of the court staff. Each docket 1 INTRODUCTION event in a typical electronic case management system includes (1) metadata generated at the time of filing, including both case-specific The transition from paper to electronic filing in national, local, and data (e.g., case number, parties, judge) and event-specific data (e.g., administrative courts, which began in the late 1990s, has transformed the attorney submitting the document, the intended document type) how courts operate and how judges, court staff, attorneys, and the and (2) a text document in PDF format. Each of the two court public create, submit, and access court filings. However, despite databases in which the experiments described below were performed many advances in judicial access and administration brought about contained filings for over 400,000 cases involving over 1,000,000 by electronic filing, courts are typically unable to interpret the con- litigants, attorneys, and judges, over 10,000,000 docket entries, and tents of court filings automatically. Instead, court filings are inter- more than 4,000,000 documents. preted only when they are read by an attorney, judge, or court staff member. Machine interpretation of court filings promises a rich source 3 DOCKET ERROR DETECTION of information for improving court administration and case man- There are many kinds of docket errors, including defects in a submit- agement, access to justice, and analysis of the judiciary. The de- ted document (e.g., missing signature, sensitive information in an velopment of large-scale text analytics makes such interpretation unsealed document, missing case caption) and mismatches between increasingly feasible, as collections of court documents are, in ef- the content of a document and the context of the case (e.g., wrong fect, annotated by the metadata generated when they are submitted, parties, case number, or judge; mismatch between the document title by corrections when they are audited, or, for those documents that and the document type asserted by the user). Some errors consist of are motions or claims, by the decisions of judges or other decision violations of a particular court’s local rules and are therefore unique makers. to that court. Other events, such as filing in a wrong case, constitute However, there are numerous challenges to automating the in- errors in any court. In either case, detection of defects at submission terpretation of case filings. Courts often accept documents in the time could spare attorneys the embarrassment of submitting a de- form of PDFs created from scans. Scanned PDFs require optical fective document and the inconvenience and delays of refiling. For character recognition (OCR) for text extraction, but this process court staff, automated filing error detection could reduce the quality introduces many errors and does not preserve the document layout, control (QC) auditing staff required for filing errors, a significant which contains important information about the relationships among drain of resources in many courts. text segments in the document. Moreover, the language of court In: Proceedings of the Second Workshop on Automated Semantic Analysis of Informa- 3.1 Error Detection through Text Classification tion in Legal Text (ASAIL 2017), June 16, 2017, London, UK. In the court in which the first set of experiments were conducted, the Copyright © 2017 held by the authors. Copying permitted for private and academic purposes. QC staff review filings to detect a variety of docket errors, including Published at http://ceur-ws.org the following four error types: ASAIL 2017, June 16, 2017, London, UK L. Karl Branting • Event-type errors, i.e., specifying the wrong event type for a document, e.g., submitting a Motion for Summary Judgment as a Counterclaim. In experiments involving this court, there were 20 event types, such as complaint, transfer, notice, order, service, etc. • Main-vs-attachment errors, i.e., filing a document, such as an exhibit, that should be filed as an attachment to another document, as a main document or filing a document, such as a Memorandum in Support of a Motion for Summary Judgment, that should be filed as a main document, as an attachment. • Show-cause order errors. In some courts, only judges are permitted to file show-cause orders; it is an error if an attorney does so. Figure 1: The information gain of the 10 highest-information • Letter-motion errors. In some courts, certain routine mo- terms for 3 legal-document classification tasks. tions can be filed as letters, but all other filings must have a formal caption. Recognizing these errors requires distin- guishing letters from non-letters. Event-type errors appear to be the most common docket errors in U.S. District courts. Each of these filing errors can be detected by classifying a doc- ument with respect to the corresponding set of categories (event type, main vs. attachment, show-cause order vs. non-show-cause order, or letter vs. non-letter) and evaluating whether the category is consistent with the metadata generated in the docket system by the filer’s selections. Event-type document classification is particularly challenging both because document types are both numerous and skewed, having a roughly power-law frequency distribution in the test set. The first set of experiments attempted to identify each of the four docket errors above by classifying document text and determining whether there is a conflict between the apparent text category and the document’s metadata. Classification was performed with the lingpipe1 LMClassifier, which performs joint probability-based clas- sification of token sequences into non-overlapping categories based Figure 2: Reduction of a full document to just high information- on language models for each category and a multivariate distribution gain terms. over categories. 3.1.1 Term Selection and Document Truncation. Court fil- ings can be thought of as comprising four distinct sets of terms: We hypothesized that only procedure terms are relevant to the type of a document, so we explored approaches to filtering non- • Procedural words, which describe the intended legal func- procedure terms. Elimination of irrelevant terms can not only speed tion of the document (e.g., “complaint,” “amended,” “coun- execution, but in some cases has been shown to increase accuracy sel”) [13]. • “Stop-words," which are non-content common words, such Three approaches to term selection were investigated: two ad hoc as “of” and “the” and domain-specific; and one general and domain-independent. The • Words unique to the case, such as names, and words ex- first approach was to eliminate all terms except non-stopwords that pressing the narrative events giving rise to the case; and occur in the Federal Rules of Civil Procedure [7]. A related alter- • Substantive (as opposed to procedural) legal terms (e.g., native approach was to remove all terms except for non-stopwords “reasonable care,” “intent,” “battery”). occurring in “event” (i.e., document) descriptions typed by filers Terms in the first of these sets–procedural words–carry the most when they submit into the docket system. The third approach was to information about the type of the document. These words tend to select terms based on their mutual information with each particular be concentrated around the beginning of legal documents, often in text categories [3]. The first lexical set, termed FRCP, contains 2658 the case caption, and at the end, where distinctive phrases like “so terms; the second, termed event, consists of 513 terms. Separate ordered” may occur. mutual-information sets were created for each classification task, reflecting the fact that the information gain from a term depends on 1 http://alias-i.com/lingpipe/ the category distribution of the documents. For example, Figure 1 Automating Judicial Document Analysis ASAIL 2017, June 16, 2017, London, UK Table 1: Thresholds and size of large and small high information-gain term sets. showcause main_attch types letter ig_small 0.01 (135) 0.025 (262) 0.1 (221) 0.0005 (246) ig_large 0.0025 (406) 0.0125 (914) 0.05 (689) 0.00001 (390) had little effect on accuracy for the letter and main vs. attachment de- tection tasks. No reduced-vocabulary set consistently outperformed the others. This indicates that restricted term sets derived through information gain perform roughly as well as those produced using domain-specific information, suggesting that the reduced vocabulary approach is appropriate for situations in which domain-specific term information is unavailable. Summarizing over the tests, the the highest mean f-measure based on text classification alone and the particular combination of param- eters that led to this accuracy for each classification task were as follows: (1) Event type: 0.743 (prefix=50, 4-gram ig_large vocabulary, 20 categories) (2) Main-vs-attachment: 0.871 (prefix=256, 6-gram, event Figure 3: Classification accuracy as a function of reduced vocab- vocabulary) ulary (8-fold cross validation using a 4-gram language model, (3) Show-cause order: 0.957 (prefix=50, 5-gram, ig_small 50-token prefix length, and no suffix). vocabulary) (4) Letter-vs-non-letter: 0.889 (prefix=50, no 4-gram, ig_large vocabulary) shows the 10 highest information terms for three different classifica- tion tasks: event-type classification, distinguishing letters from non 3.2 Incorporating Procedural Context Features letters, and show-cause order detection, illustrating that the most The accuracy of event-type detection (f-measure of roughly 0.743 informative terms differ widely depending on the classification task. under the best combinations of parameters) is sufficiently low that Figure 2 illustrates the reduction of full document text to just high its utility for many auditing functions may be limited. An analysis information gain terms, which typifies the vocabulary-reduction of the classification errors produced by the event-type text classi- process. fication model indicated that a document’s event type depends not Several approaches to document truncation were explored as well. just on the text of the document but also on its procedural context. The first was to limit the text to the first l tokens of the document For example, motions and orders are sometimes extremely similar (i.e., excise the remainder of the document). If l is sufficiently large, because judges grant a motion by adding and signing an order stamp this is equivalent to including the entire document. A second option to the motion. Since stamps and signatures are seldom accurately is to include the last l tokens of the suffix as well as the prefix. OCR’d, the motion and order may be indistinguishable by the text 3.1.2 Evaluation of Alternative Term Reduction Approaches. alone under these circumstances. However, orders can be issued only Two different information-gain thresholds were tested for each clas- by a judge, and judges never file motions, so the two cases can be sification type, intended to create one small set of very-high in- distinguished by knowing the filer. In addition, attachments have the formation terms (ig_small) and a larger set created using a lower same event type as the main document in CM/ECF. So, for example, threshold (ig_large). The thresholds and sizes of the large and small a memorandum of law is ordinarily a main document, but in some high information-gain term sets are set forth in Table 1. The text of courts a memorandum can be filed as an attachment, in which case each document was obtained by OCR using the open-source program its event type is the same as that of the main document to which it is Tesseract [20]. Each text was normalized by removing non-ASCII attached. characters and standardizing case prior to term selection, if any. Contextual information potentially relevant to a document’s type Figure 3 shows a comparison of four vocabulary alternatives on includes: whether it was filed as a main document or as an attach- the four text classification tasks described above. These tests mea- ment; the filer (e.g., attorney, clerk, judge); the type of the case (e.g., sured mean f-measure in 8-fold cross validation using a 4-gram criminal, civil, multi-district); and the document length (e.g., memo- language mode, 50-token prefix length, and no suffix. In the baseline randa are typically long; minute orders typically short). Combining vocabulary set, normalize, non-ASCII characters, numbers, and these non-lexical features with text features requires a different punctuation are removed and tokens were lower-cased. The results classifier than the language-model classifier used in the first set of show that classification accuracy using an unreduced vocabulary was experiments. significantly lower than the best reduced vocabulary performance for We compared the performance of SupportVector Machine (SVM) show-cause order detection and type classification. Term selection learning (WEKA’s implementation of Platt’s algorithm for sequential ASAIL 2017, June 16, 2017, London, UK L. Karl Branting Figure 4: Event type classification accuracy as a function of re- duced vocabulary (10-fold cross validation using a 2-gram lan- guage model, normalization of dates, numbers, and parties, 100- token prefix length, and minimum token frequency of 32, with Figure 5: The proportion of groups for which the order is more 43 event types). similar to the triggering motion than to any other motion. A straightforward approach to this task is to treat order/motion minimal optimization [10, 16]) and Random Forests [5], both in the matching as an information-retrieval task, under the hypothesis that WEKA [9] implementation, on the task of filing event classification. an order is likely to have a higher degree of similarity to its cor- For each filing event, the document text was normalized by filtering responding motions than to motions that it does not rule on. An stop words, normalizing dates and numbers to standard tokens, and obvious approach is to present pending motions to the clerk in rank replacing each instance of a party name with the role of that party order of their TF/IDF2 -weighted cosine similarity to the order. (e.g., DFT, PTF). The result was combined with the contextual The evaluation above showing that term selection improves docu- features and converted into a sparse n-gram frequency vector from ment classification raises the question whether term selection might which the {1,2,4,8} thousand highest information gain features were be beneficial for order/motion matching as well. A second question selected (unsurprisingly, the contextual features always had higher is whether the IDF motion should be trained on an entire corpus of information gain than any lexical feature). The training set consisted motions and orders or whether acceptable accuracy can be obtained of 28,763 main documents having 43 distinct types representing 2 by training just on the order and pending motions. month’s filings in a large US District court. To evaluate the effectiveness of this approach to order/motion As shown in Figure 4, the SVM was consistently more accurate, match, a subset of the document set described above was collected with little variation in accuracy as a function of the number of consisting of 3,356 groups, each comprising (1) an order, (2) a features, although accuracy was slightly higher at 4,000 features motion that the order rules on (a triggering motion), and (3) a non- than other feature set sizes. By contrast, the accuracy of the random empty set of all motions that were pending at the time of the order but forest diminished with increasing numbers of features. The highest not ruled on by the order (non-triggering motions). The mean number accuracy SVM configuration, f-measure of 0.926, was much higher of motions per group was 5.87 (i.e., there were on average 4.87 than the maximum observed with text-only classification (albeit, in non-triggering motions). For each group, all motions were ranked a different court). This suggests that including procedural context by similarity to the order under the given metric. The proportion features is essential for accurate document filing type identification of triggering motions that were ranked first and mean rank of the in judicial databases, and that algorithms that can handle both textual triggering motion were calculated from each group’s ranking. and categorical features should be used for this task. These groups were evaluated using three term selection approaches: the raw document text (which often contains many OCR errors); 4 ORDER/MOTION MATCHING normalization, as described above; and event terms. The two al- ternative TF/IDF training models were applied to each of the three In many federal courts, docket clerks are responsible for filing orders term selection approaches, for a total of 6 combinations. For each executed by judges into the docket system, a process that requires the combination, the mean rank of the triggering motion among all the clerk to identify all pending motions to which the order responds and motions was determined. to link the order to those motions. This entails reading all pending Figure 5 shows that the highest accuracy, as measured by the motions, a tedious task. If the motions corresponding to an order proportion of triggering motions that were ranked first among all could be identified automatically, docket clerks would be relieved pending motions, was achieved by normalizing the text without of this laborious task. Even ranking the motions by their likelihood term selection. Intuitively, reduction to procedurally relevant terms of being resolved by a given order would decrease the burden on improves the ability to determine what docket event a document docket clerks. Moreover, order/motion matching is a subtask of a performs, but can reduce the ability to discern the similarity between more general issue-chaining problem, which consists of identifying corresponding pairs of documents. TF/IDF training on just the order the sequence of preceding and subsequent documents relevant to a given document. 2 Term Frequency/Inverse Document Frequency Automating Judicial Document Analysis ASAIL 2017, June 16, 2017, London, UK Unfortunately, in many simple and routine administrative do- mains, only the decisions themselves, but not the underlying case records, are available. However, in such cases, the statement of facts in the decision can be used as a proxy for the contents of the cor- responding case record. This approach was applied to decisions of the European Court of Human Rights in [1], which found that case outcomes could be predicted to some degree from statements of fact. The predictability of case outcomes from the statement of facts in the decision document doesn’t conclusively demonstrate that the outcome would be equally predictable from the raw case record; the decision maker’s description of the facts may have been tailored to fit the outcome. However, a demonstration that case outcomes can be predicted to some extent by models trained from fact statements alone may encourage courts and agencies to experiment with this approach to creating decision-support tools for pro se litigants and Figure 6: The mean rank of the triggering order among all decision makers. pending orders, zero-indexed (lower is better, zero is perfect). Accordingly, an experiment was performed to evaluate the fea- sibility of predicting decisions from the fact statements of cases in representative domain: World Intellectual Property Organization and pending motions (local) is at least as accurate as training over (WIPO) domain name decisions.3 Domain name decisions resolve all orders and motions (all). Figure 6 shows the mean rank (zero disputes between a domain name registrant and a third party under indexed) of the most similar motion under each of the six conditions. the Uniform Domain Name Dispute Resolution Policy (UDRP).4 The best (lowest) mean rank was achieved with normalization and The UDRP Administrative Procedure applies to disputes concerning local TF/IDF training. an alleged abusive registration of a domain name under the following It is not unusual for a single order to rule on multiple pending criteria: motions. A more realistic assessment of the utility of pending motion • The domain name registered by the domain name registrant ranking is therefore to determine how many non-triggering motions is identical or confusingly similar to a trademark or ser- a clerk would have to consider if the clerk read each motion in vice mark in which the complainant (the person or entity rank order until every motion ruled on by the order is found. One bringing the complaint) has rights; and way to express this quantity is as mean precision at 100% recall. • The domain name registrant has no rights or legitimate In the test set described above, using text normalization and local interests in respect of the domain name in question; and TF/IDF training, mean precision at 100% recall was 0.83, indicating • The domain name has been registered and is being used in that the number of motions that a clerk would have to be read was bad faith significantly reduced. WIPO decisions have a very consistent structure, including sec- 5 DECISION PREDICTION tions for History, Background, Contentions, Findings, and Decision. Just two distinct decisions are possible: transferring the domain Predictive models of decision making could be useful to pro se liti- name or denying the complaint. As a result, the decisions are well gants (for help in understanding the strength of a case), to attorneys suited to the supervised concept-learning approach described above. (for help in making strategic litigation decisions), and for training and decision support for judges and other decision makers. Even 5.1 Experimental Design if the accuracy of predictive models were only approximate, they could nevertheless be valuable for decision support by helping to Six thousand six hundred WIPO decisions were downloaded and identify the most relevant words, phrases, or other features of a case parsed into the five sections described above. Each decision was record and the most relevant previous decisions. labeled TRUE or FALSE based on whether the decision transferred Highly-accurate predictive models would require very detailed the domain name (TRUE) or denied the claim (FALSE). The result- linguistic analysis of the text of case records and decisions, including ing set of cases had significant class skew, with 6,000 instances of argument structure, narrative analysis, etc. [8]. However, predictive TRUE but only 500 instances of FALSE. For a preliminary study, the models induced from simpler lexical features may be sufficiently 500 random TRUE instances were subsampled to create a balanced accurate to be useful for the tasks listed above. Inducing such mod- test set with 500 instances of each category. els can be cast as supervised concept learning over corpora of case This balanced set of cases was converted into a series of test sets records and decisions, where each decision is treated as a category la- differing in which sections were included as the text of each instance. bel for the corresponding case record. This approach is feasible only The sections tested were as follows: for simple and routine cases for which it is possible to enumerate a • History small set of category labels, such as granting or denying a specific • Background benefit or form of relief. However, such simple and routine cases are • Contentions characteristic of many forms of administrative adjudication, such as 3 http://www.wipo.int/amc/en/domains/decisions.html immigration status and benefits entitlement. 4 https://www.icann.org/resources/pages/policy-2012-02-25-en ASAIL 2017, June 16, 2017, London, UK L. Karl Branting Figure 7: Mean f-measure in ten-fold cross-validation with Sup- port Vector Machine prediction of WIPO case outcomes. • The concatenation of History, Background, and Contentions • Findings The text of each instance was normalized by standardizing case and removing punctuation and, in addition, either (1) removing stop words, or (2) retaining stop words but replacing dates and numbers with standard tokens (“NUMBER" or “DATE"). The test condition in which the text consists of Findings is included for completeness, although it is not a good proxy for the case record as it contains conclusions about the facts. For each selection of case sections and standardization method, Figure 8: A subset of high information-gain terms in WIPO for the text was converted into n-gram frequency vectors for n=1–4, with History/Background/Contentions instances. only those n-grams retained that occur at least 8 times. The result was converted into sparse arff format,5 loaded in Weka, and evaluated in 10-fold cross-validation using Weka’s implementation support excerpt shows that phrases concerning filing, failure to submit a re- vector machine (SVM) with sequential minimal optimization. sponse, notification, and default are particularly strongly associated with the outcome of the case. One may view high information-gain Table 2: Mean f-measure in ten-fold cross-validation with Sup- phrases as being similar to the factors in [2] with the difference that port Vector Machine prediction of WIPO case outcomes. The they are induced automatically rather than being crafted manually. text of each instance consists of the History (H), Background The SVM decision surface represents the set of tradeoffs among (B), Contentions (C), all three (HBC) or Findings (F) section. these factors that is most consistent with the training data, in a man- ner reminiscent of [6], but without the necessity of domain-specific H B C HBC F hand-engineered factors. stopwords 0.943 0.758 0.822 0.950 0.955 WIPO domain name dispute cases may be particularly conducive nums/dates 0.902 0.750 0.813 0.948 0.960 to predictive modeling owing to their binary outcomes and relatively stereotypical fact patterns. This experiment does not address the differences between the case record and the facts as summarized in the decision document, and the evaluation above artificially dimin- 5.2 Experimental Results ished the effect of class skew by subsampling to produce a balanced As set forth in Table 2 and Figure 7, the greatest predictive accuracy test set. Nevertheless, the impressive accuracy of a predictive model was achieved by the combination of the History, Background, and trained on raw text without any feature design or knowledge en- Contentions sections of each case (HBC). The predictive accuracy gineering suggests that this approach may have great promise for from these three sections, f-measure of roughly 0.95, was almost as increasing access to justice for pro se litigants and improving train- high as the accuracy of prediction based on the text of the Findings ing and decision support for decision makers in domains with many section. routine adjudications. To understand why the HBC text is so predictive, it is helpful to examine the terms with the highest mutual information with the 6 RELATED WORK concept to be predicted, some of which are shown in Figure 8. This The history of applying text classification techniques to legal doc- 5 http://www.cs.waikato.ac.nz/ml/weka/arff.html uments dates back at least to the 1970s [4]. Text classification has Automating Judicial Document Analysis ASAIL 2017, June 16, 2017, London, UK been recognized as of particular importance for electronic discov- [2] V. Aleven and K. Ashley. Doing things with factors. In Proceedings of the ery [18]. Little prior work has addressed classification of docket Third European Workshop on Case-Based Reasoning (EWCR-96), pages 76–90, Lausanne, Switzerland, November 1996. entries other than Nallapati and Manning [14], which achieved an [3] R. Battiti. Using mutual information for selecting features in supervised neural f-measure of 0.8967 in distinguishing Orders to Show Cause from net learning. IEEE Transactions on Neural Networks, 5(4):537–550, Jul 1994. [4] J. Boreham and B. Niblett. Classification of legal texts by computer. Information other document types using a hand-engineered feature set. Processing & Management, 12(2):125 – 132, 1976. There is extensive current activity in predictive models trained [5] L. Breiman. Random forests. Mach. Learn., 45(1):5–32, Oct. 2001. on factors unrelated to the merits of the case such as the nature [6] S. Brüninghaus and K. Ashley. Generating legal arguments and predictions from case texts. In Proceedings of the Tenth International Conference on Artificial of suit, attorneys, forum, judge, and parties [19]. Recent startups Intelligence & Law (ICAIL-05), pages 65–74, Bologna, Italy, June 6–11 2005. marketing predictive models for litigation support based on non [7] L. I. I. Cornell University Law School. The federal rules of civil procedure. merits-based factors include Lex Machina [11], LexPredict [12], https://www.law.cornell.edu/rules/FRCP. [8] D. Gutfreund, Y. Katz, and N. Slonim. Automatic arguments construction–from and Premonition [17]. The insurance industry has a long history of search engine to research engine. In 2016 AAAI Fall Symposium Series, 2016. developing decision prediction based on the merits of a claim, but [9] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: An update. SIGKDD Explorations, 11(1), 2009. these models are typically manually constructed, e.g., [15]. Outcome [10] S. Keerthi, S. Shevade, C. Bhattacharyya, and K. Murthy. Improvements to platt’s prediction based the merits of the case as extracted directly from raw smo algorithm for svm classifier design. Neural Computation, 13(3):637–649, text is a relatively new research area, with little work outside of [1]. 2001. [11] Lex machina. https://lexmachina.com/ [Accessed: 27 November 2016]. [12] Lexpredict. https://lexpredict.com/ [Accessed: 29 November 2016]. [13] R. E. Madsen, S. Sigurdsson, L. K. Hansen, and J. Larsen. Pruning the vocab- 7 SUMMARY AND FUTURE WORK ulary for better context recognition. In Pattern Recognition, 2004. ICPR 2004. Judicial document collections contain a rich trove of potential infor- Proceedings of the 17th International Conference on, volume 2, pages 483–488. IEEE, 2004. mation, but analyzing these documents presents many challenges. [14] R. Nallapati and C. D. Manning. Legal docket-entry classification: Where machine This paper has demonstrated how many types of filing error detection learning stumbles. In Proceedings of the Conference on Empirical Methods in can be formulated as text classification problems. The highest accu- Natural Language Processing, EMNLP ’08, pages 438–446, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. racy was obtained by combining lexical features that characterize [15] M. Peterson and D. Waterman. Rule-based models of legal expertise. In C. Walters, the document itself with procedural context features that indicate the editor, Computing Power and Legal Reasoning, pages 627–659. West Publishing Company, Minneapolis, Minnesota, 1985. role that the document is intended to play. These results demonstrate [16] J. C. Platt. Fast training of support vector machines using sequential minimal the feasibility of automating portions of the process of auditing court optimization. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances submissions, which could significant reduce a persistentdrain on in Kernel Methods, pages 185–208. MIT Press, Cambridge, MA, USA, 1999. [17] Premonition. https://premonition.ai/ [Accessed: 27 November 2016]. court resources. [18] H. L. Roitblat, A. Kershaw, and P. Oot. Document categorization in legal electronic The experiment with order/motion matching demonstrates that discovery: computer classification vs. manual review. Journal of the American while term selection may improve accuracy for document classifica- Society for Information Science and Technology, 61(1):70–80, 2010. [19] M. Surdeanu, R. Nallapati, G. Gregory, J. Walker, and C. Manning. Risk analysis tion, it can decrease accuracy for tasks that involve matching based for intellectual property litigation. In Proceedings of the Thirteenth International on overall similarity rather than procedural similarity. Conference on Artificial Intelligence and Law, Pittsburgh, PA, June 6–10 2011. ACM. The demonstration of outcome prediction in WIPO decisions [20] Tesseract. https://en.wikipedia.org/wiki/Tesseract_(software). illustrates that for case corpora with a limited set of possible out- comes and relatively stereotypical fact patterns, decision models of impressive accuracy can be induced without hand-engineered features, simply from the fact descriptions. This approach may be particularly promising for decision support and improved access to justice in the simpler and more routine end of the judicial spectrum. No single technology is applicable to all judicial documents, nor is any approach sufficient for all document analysis tasks. However, each addition to this suite of technologies adds to the capabilities available to the courts, government agencies, and citizens to exploit the deep well of information latent in judicial document corpora. ACKNOWLEDGMENT The MITRE Corporation is a not-for-profit Federally Funded Re- search and Development Center chartered in the public interest. This document is approved for Public Release, Distribution Unlimited, Case Number 17-0362. ©2017 The MITRE Corporation. All rights reserved. REFERENCES [1] N. Aletras, D. Tsarapatsanis, D. Preotiuc-Pietro, and V. Lampos. Predicting judicial decisions of the European Court of Human Rights: a natural language processing perspective. PeerJ CompSci, October 24 2016. https://peerj.com/ articles/cs-93/.