Document Segmentation Labeling Techniques for Court Filings Alex Lyte, Karl Branting The MITRE Corporation {alyte,lbranting}@mitre.org Automating this process would free limited court resources for ABSTRACT more productive purposes (Branting 2017). Arguments, motions, and decisions in courts of the United States However, automated extraction of document metadata requires of America are recorded in PDF documents filed in each court’s identifying the type and location of the fields in the case caption docket. Utilization of these documents as data requires accurate and footer. This process could be assisted by machine and efficient information extraction methods. We take a transcription, but there are several challenges. For one, many supervised machine learning approach to a portion of this task, documents are first printed on paper and then scanned into PDF predicting metadata labels in court filings. On a dataset of about form. Thus, a common format for these documents is an image, 2500 annotated scanned PDF images with 21 labels, we found rather than plain text or XML. Moreover, recovering the layout of that traditional classifiers such as MaxEnt achieved an average native PDF documents can itself be challenging, as described F1-score of 0.44 (micro-averaged across labels), with the highest below. label (Body) at 0.88. However, a 1-dimensional sequences in the text, Mallet’s CRF implementation, achieved an average F1-score There are tools available for image analysis, as well as for of 0.6 across all labels, with some labels as high as 0.91. These converting documents to plain text or XML, such as Apache Tika. results demonstrate the value of using sequence models over But further challenges arise in how the information is laid out on traditional classifiers in labeling the types of information in court the page. There is some structure in the layout of a court filing; filings. the court is at the top of the page, with the parties below it, and the document number to the right of the parties. However, the In: Proceedings of the Third Workshop on Automated Semantic actual physical position of this information can vary based on the Analysis of Information in Legal Text (ASAIL 2019), June 21, amount of text and the conventions of the court. Many courts have 2019. Montreal, QC, Canada. small variations in how the information is presented, such as © 2019 Copyright held by the owner/author(s). Copying right-justifying vs. centering the court, or putting the document permitted for private and academic purposes. Published at number at the top of the page. http://ceur-ws.org. Since there is no fixed location of information on each page, and Approved for public release; distribution unlimited. Public rarely any indicative metadata, it becomes very difficult to Release Case Number 9-1137. © 2019 The MITRE Corporation. automatically determine which piece of text is the court, the All Rights Reserved. parties, and the document number. Additionally, things like stamps and signatures are often placed arbitrarily on the page, 1. INTRODUCTION introducing noise in any image-to-text conversion. A court filing is a legal document submitted to a court that triggers When a document image is converted into XML via a conversion an event in a legal proceeding. Court filings indicate what the case tool like Apache Tika, there are a number of features that can be is about, why it should be in that court, and the grounds for the taken from the new structure. In this paper, we attempt to assign legal dispute. The legal effect of a filing depends critically on the a label to each word using both lexical and positional features. event that it is intended to trigger (e.g., dismissal, answer to Positional features include the x and y position of each word, the complaint, substitution of counsel), the role of the filer (plaintiff, quadrant of the page it is in, and the distance from other words defendant, court, intervener, etc.), the context of the filing (e.g., around it. Lexical features include the word itself, the word case, the previous filing, if any, that it is intended to respond to), the word type, and indicators of the word matching typical words whether it has been properly signed, and other document in each type. characteristics. Any process for automated analysis of court filings must determine the contents of these fields, which we refer In our analysis, we find that positional features alone are not to as “metadata”, to distinguish it from the content of the body of sufficient to classify most words, but reasonable performance can a document. be obtained by including both lexical and positional features. A simple example of importance of automated metadata 2. RELATED WORK extraction is automated document quality control; that is, Several research communities have been active in document detection of discrepancies between the document metadata (such analysis, including historians, librarians, scientists, legal as the case number) and the metadata specified by the filer (e.g., technologists, and those in government. Each community comes the number of the case that the document was filed into). The shift with a different set of data and goals, but all follow a similar to electronic filing systems, such as the US Federal Judiciary’s processing framework. CM/ECF system, by increasing numbers of courts means that filings are no longer inspected for errors by an intake clerk. There are several ways approach the information extraction from Instead, this function is often performed by quality-control staff. documents. One of the first tasks is separating the elements of the 1 page, a process called segmentation. (Mao, Rosenfeld, & This allows the scorer to ignore slight variations in the actual x/y Kanungo, 2001) distinguish between physical and logical layout locations of the blocks and focus on how much content is in segmentation. Physical segmentation includes identifying the common. lines, spaces, blocks, and other elements on the page. Logical Once the documents were annotated and converted into XML segmentation seeks categorize these elements by their function with labels, a toolchain was constructed to build models for (e.g. headers, footers, content trees). Methods of logical automated inference of the textual (non-stamp) blocks given the segmentation include rule-based approaches, comparison against hOCR output. knowledge-bases, and unsupervised learning. The fundamental problem with standard text-based approaches is More recently, researchers have approached the problem by that the text on these pages is not running text, but rather in converting the elements on the page into vectors and using blocks, so serializing the blocks in a standard line-oriented way supervised machine-learning models to classify the logical may obscure the structure of the document and lead to problems function of each element on the page. (Souafi-Bensafi, Parizeau, applying standard structural techniques. Our hypothesis has been Lebourgeois, & Emptoz, 2001), for example, identified a that using a graphical modeling inference strategy, allowing us to hierarchy of geometric text blocks in various publications, and, create much more structurally sophisticated contextual along with typographical information, constructed a vector dependencies among elements, including 2-dimensional representation for each word. They then used a Bayesian network geometry, would enhance our ability to learn the location of these classifier to label the logical function of each word. blocks. Standard classifiers, such as SVMs, Bayes nets, and random Our strategy is an enhancement of the standard classification forests, can be considered 0-dimensional models, in that they only approach. Our goal has been to be able to compare multiple consider the features of each token, but not the sequence of tokens strategies to each other, including these strategies which build on around it, Sequence learning algorithms, such as Conditional these sophisticated contextual dependencies. Therefore, we've Random Fields (CRF), can be considered 1-dimensional built a general-purpose experimentation harness for this family of classifiers, in that they consider the features of the elements classifiers. before and after each token. (Trompper & Winkels) used a CRF model to classify header types in Dutch court documents from First, from the hOCR output for a given page, the tool constructs XML and found that CRFs outperformed a deterministic tagger. a set of features for each token in the document. These features can be atomic features, string-valued features, or float-valued Two-dimensional sequence learners can consider sequences of features. These features include: tokens in multiple directions and can thus exploit horizontal and vertical relationships between elements in documents. In ‘2D • case features, related to capitalization pattern of the Conditional Random Fields for Web Information Extraction’, token (Zhu, Nle, Wen, Zhang, & Ma) successfully used a 2D CRF to • digit and garbage features, related to the distribution classify sections of web pages. of digits and non-alphabetic characters in the token In this paper, we focus on assigning logical labels to words in • word and ngram features, related to the character each court filing. We converted each scanned PDF into sequence of the token hierarchical OCR (XML) using Apache Tika and developed • tag features, derived from applying the Stanford positional and linguistic features for each word token. We then toolkit named entity tagger to the linearized text (these compared 0-,1-, and 2-dimensional models to identify the relevant features are not likely to do much work for us, given sections of the page. the known problems with simply serializing this text line-by-line) 3. APPROACH • similarity features which identify the best reasonably In this work, a labeled dataset was constructed from scanned close match between the token and some of the PDFs of court filings. This was done using an annotation tool metadata for the case for the document (e.g., the called the MITRE Annotation Tool (MAT), developed by The names of the parties or attorneys) MITRE Corporation. This tool contains resources for creating, • 2-dimensional location features which indicate the maintaining and scoring annotated corpora of page images. The position of the token on the page (what quadrant its in, tool contains a set of annotation guidelines which we settled on and what percentage from the origin it is) after a number of rounds of pilot annotation. These guidelines • margin features indicating words on the margin and focus on the first and last pages of court filings and legal letters. whether they're indented The annotator is asked to locate the major, non-nested sections of these pages (signatures, caption, court, body, etc.), as well as non- text stamps (such as received stamps), which are annotated for They can also be features on links between tokens, e.g., whether future reference. The annotation tool is Web-based and provides two tokens are farther apart than the average or median distance a graphical tool for identifying blocks and labeling them. In between tokens in the horizontal direction, or whether two tokens comparison mode, the tool can compare two annotators' efforts to are more than one line apart in the vertical direction or indicating each other. whether two tokens are on the same line. The tool exploits a position-aware OCR output format known as This array of features, then, provides two levels of position hOCR, which presents each word along with its pixel-level sensitivity: first, on the token level, with the 2-dimensional location block on the page from which it was extracted. This location features, and second, with links between the tokens, for position awareness allows us to score annotator blocks against engines which recognize such features. each other, by determining which words are within each annotator We explored three classes of algorithms: block and how many of the words are in common between blocks. 2 • 0-dimensional token classifiers, represented by a amounting to about 3500 annotated pages (some documents are maximum-entropy algorithm, implemented separately only one page long). by the MALLET1 engine, and by the Mandolin2 engine. • 1-dimensional linear CRF, also implemented with the MALLET and Mandolin engine. • 2-dimensional CRF, where the dimension here refers not to geometric dimensions but abstract properties of the engine. Our goal, however, has been to use these properties to encode context dependencies in two dimensions. This was implemented only with Mandolin; a MALLET-equivalent (GRMM) implementation was attempted but unsuccessful. Only the Mandolin engine explicitly represents links between tokens. We model our 2 geometric dimensions by computing unobstructed overlap between tokens in the vertical direction, as well as using line adjacency in the horizontal direction. Only the 2-dimensional model captures feature information in the vertical direction in our approach. Figure 2: Examples of a case caption and footer with labeled fields. Each court document contains the name of the court, the parties in the case, the case number, and the document title. Each word in the document is extracted, and positional and lexical features are determined from the words and their context. Several machine learning algorithms were then used to construct models to predict the labels based on the training data. Figure 3: Number of occurrences of each type of word The data was separated into batches, with each batch containing about 150 documents. Overall, about 22 batches were used for Figure 1: Examples of the varied structure of court filings training, and 2 batches were used for testing. Within each batch, each document was divided into words, with features assigned to each word based on its positional and lexical elements. The 4. DATA number of words with each label vary, with the body containing Our corpus consists of the first and last pages drawn from the most words on average, and the caption a distant second, as approximately 2500 court filings, PDFs typified by Figure 1, illustrated in Figure 3. 1 2 http://mallet.cs.umass.edu/ http://project-mandolin.github.io/mandolin/index.html 3 0.40 Lex stanford_lemma Lemmatized token 0.37 Lex uncacheableAtomicFeatures n-grams of token text 0.22 Pos right_indent Indentation from Right Margin 0.16 Pos v_half Top or Bottom of Document 0.13 Lex wText Token Text 0.12 Lex stanford_pos Part-of-Speech Figure 6: Information Gain metrics for top 10 features 6. EXPERIMENTS Our hypothesis is that a combination of positional, lexical, and contextual information can be used to determine the function of each word on the page. To test this, a dataset of case metadata was developed, with features extracted about each token, including positional and lexical information. Each token was then assigned Figure 4: Sample locations of words, colored by label to the metadata label of the field where it occurred, and a machine Labels tend to occupy certain regions consistently, though their learning algorithm was trained to predict the label based on the actual position can vary greatly. As an illustration of this, we features of the token. This was treated as a multi-label training plotted the X and Y coordinates of a sample of words, colored by task in which the F1 score was calculated separately for all tokens label, in Figure 4. occurring in each documents field, i.e., for each label (document title, case caption, etc.). This may enable future researchers, who 5. FEATURE EXTRACTION are interested in only a subset of the metadata, to get a baseline Around 25 features of the data were identified and extracted for for the difficulty of extraction. each token, characterizing its positional, linguistic, and contextual information. An ablation study was performed in which each model was evaluated with each type of feature, lexical or positional, present Using Weka’s ‘Information Gain’ evaluator, the features were or absent in order to determine its relative contribution to ranked according to the predictive value they provide. The most classification accuracy (s. Finally, the predictive accuracies of highly ranked features, pct_y_from_origin and several alternative predictive models were compared, including pct_y_from_origin, represent the position of the token on the standard classifiers with 1D and 2D sequence models. The results page. After that, entryType, stanford_lemma, and of each of these experiments in terms of mean F1-score across all uncacheableAtomicFeatures deal with the linguistic properties labels is shown in Figures 7-9. of the data. Finally, otherWText and right_indent deal with the relative positional information of the token to other tokens in the 7. RESULTS text. As Figure 7 shows, some metadata labels are reliably predictable Each type of feature, positional, lexical, and contextual, help the using a combination of positional and lexical features. The degree model determine the role of the text on the page. While that of accuracy on some labels, as high as 90%, could be useful in doesn’t necessarily mean that is the information humans use to many extraction tasks. Further, results were significantly make the determination, it is a relatively intuitive result: the improved by adding both positional and lexical features, and by position, type of word, and relation to the words around it, all using models that consider sequences, such as CRFs. indicate the function of each word in the text. Label F1 Label F1 Info Type Feature Description body 0.91 doc_title 0.68 Gain court 0.9 case_type 0.67 0.44 Pos pct_y_from_origin Vertical Distance from date_addr 0.84 typed_sig 0.64 Origin valediction 0.82 form_number 0.63 0.41 Pos pct_x_from_origin Horizontal salutation 0.79 venue 0.62 Distance from caption 0.78 signer_info 0.59 Origin recip_info 0.72 date 0.55 0.41 Lex entryType Named Entity Type case_number 0.72 cc_info 0.46 0.41 Lex otherWText Word performative 0.71 notary_block 0.46 following case_title 0.71 letterhead 0.42 token 4 Figure 7: Top 20 metadata labels by max F1 score models has an intuitive appeal, we did not find that they increased In particular, when the word value of a token (i.e., Token Text) accuracy over the 1D sequence model. was the sole feature, non-sequence models classified most tokens as ‘Body,’ with a small proportion tagged as ‘Court’. This appears 9. FUTURE WORK There are several areas for improvement in this task. In general, to be due to the fact that ‘Court’ words are a very small and some rigorous error analysis could be performed to identify major specific set, including ‘UNITED’, ‘STATES’, ‘DISTRICT’, and classes of errors. Further tuning of the models may also improve ‘COURT’. Curiously, adding positional features to a standard results, and additional training data may allow for other models classifier did little to improve the results. However, when other such as Neural Nets. Additionally, a more mature 2D CRF lexical information was included, the F1-measure increased implementation, such as GRMM might improve performance. greatly for most other labels. Including both lexical and positional features improves the results even more, as shown in Figure 8. Finally, while initial work aims to label each word in a document, This is consistent across each of the model types and shows that using these labels to predict the label of the ‘block’ that the text while the token’s position on the page is important, the lexical is in, is the longer-term objective. This would facilitate properties of that token also play a significant role in identifying information extraction from the entire block. its label. 10. ACKNOWLEDGMENTS Thanks to Ben Wellner for providing the Mandolin models and Lexical/Positional (F1) No Lex Lex support, and to Stacy Petersen, Grace Sullivan, and Ariana Kellogg for annotating the court filings. Special thanks to Sam No Pos 0.23 0.089 Bayer for developing the training and testing framework. Pos 0.45 0.52 Figure 8: Average model F1 scores across all labels and models, 11. REFERENCES Apache Tika - a content analysis toolkit. https://tika.apache.org/. with and without lexical and positional features Clausner, C., Pletschacher, S., & Antonacopoulos, A. (2014). Document Representation Refinement for Precise Region Description. DaTeCH. In comparing the types of classifiers, the CRF’s outperformed Madrid, Spain: ACM. standard classifiers in all cases. We used two different Eskenazi, S., Gomez, P., and Jean-Ogier, M., A comprehensive survey of implementations of CRFs: Mallet CRF, and Mandolin CRF. We mostly textual document segmentation algorithms since 2008, Pattern chose to compare Mallet to Mandolin because Mandolin could be Recognition 64 (2017) 1-14. used for standard classification, 1D and 2D analysis, while Mallet Gabdulkhakova, A., & Hassan, T. (2012). Document Understanding of only included the standard classifier and 1D CRF. However, the Graphical Content in Natively Digital PDF Documents. DocEng’12 Mallet CRF has been around for quite some time, and likely (pp.137-140). Paris, France: ACM. benefitted from significant tuning. Consistent with this surmise, Klampfl, S., & Kern, R. (2015). Machine Learning Techniques for Mallet outperforms Mandolin’s 1D and 2D CRFs, as shown in Automatically Extracting Contextual Information from Scientific Figure 9. Publications. SemWebEval 2015, 105-116. Klampfl, S., Granitzer, M., Jack, K., & Kern, R. (2014). Unsupervised document structure analysis of digital scientific articles. Models/Dimensions (F1) MaxEnt CRF 2D CRF Int.J.Digit.Libr.,14, 3-4 (August 2014), 83-99. Mallet 0.27 0.44 Konstas, I., & Lapate, M. (2013). Inducing Document Plans for Concept- to-text Generation. Proceedings of EMNLP 2013, 1503-1514. Seattle, Mandolin 0.28 0.37 0.39 Washington: Association for Computational Linguistics. Figure 9: Average model F1 scores across all labels and features, Lebourgeois, F. (1996). Localisation de textes dans un image a` niveaux. organized by algorithm and dimensionality Colloque National sur l'Ecrit et le Document. Mao, S., Rosenfeld, A., & Kanungo, T. (2003). Document Structure Analysis Algorithms: A Literature Survey. SPIE Electronic Imaging However, comparing Mandolin’s 1D to its 2D CRFs, we see that 5010:197-207. most labels had an improved F-measure with the 2D. That leads Mencia, E. L. (2009). Segmentation of Legal Documents. ICAIL’09, 88- us to believe that with further tuning, the 2D CRF could do quite 97. Barcelona, Spain: Association of Computing Machinery. well, but the Mallet 1D CRF had the best results overall in these experiments. O'Gorman, L. (1993). The Document Spectrum for Page Layout Analysis. IEEE Trans. on Pat. Analysis and Machine Intelligence, Vol 15 No. 11. 8. CONCLUSIONS Ramakrishnan, C., Patnia, A., Hovy, E., & Burns, G. A. (2012). Layout- In these experiments, we found that the metadata labels (i.e., aware text extraction from full-text pdf of scientific articles. Source Code fields) of case captions and footers in US Federal court filings can for Biology and Medicine, 7:7. be predicted using a combination of positional and lexical Souafi-Bensafi, S., Parizeau, M., Lebourgeois, F., & Emptoz, H. (2001). information. Accuracy was higher for much higher for some Logical Labeling using Bayesian Networks. 6th Int. Conf. on Doc. Anal. fields, such as body, case type, and court, than others, such as the Trompper, M., & Winkels, R. (2016). Automatic Assignment of Section sender and signer info, are harder to identify. The best Structure to Texts of Dutch Court Judgements, JURIX 2016, 167-172. performance was observed from the Mallet CRF, indicating that Zhu, J., Nle, Z., Wen, J.-R., Zhang, B., & Ma, W.-Y. (2005). 2D sequence-learning techniques perform better than 1D classifiers Conditional Random Fields for Web Information Extraction. Proceedings in the domain of court filings. While the utility of 2D sequence of the 22nd International Conference on Machine Learning. Bonn, Germany. 5