Document Segmentation Labeling Techniques
              for Court Filings
                                                             Alex Lyte, Karl Branting
                                                             The MITRE Corporation
                                                            {alyte,lbranting}@mitre.org
                                                                              Automating this process would free limited court resources for
ABSTRACT                                                                      more productive purposes (Branting 2017).
Arguments, motions, and decisions in courts of the United States              However, automated extraction of document metadata requires
of America are recorded in PDF documents filed in each court’s                identifying the type and location of the fields in the case caption
docket. Utilization of these documents as data requires accurate              and footer. This process could be assisted by machine
and efficient information extraction methods. We take a                       transcription, but there are several challenges. For one, many
supervised machine learning approach to a portion of this task,               documents are first printed on paper and then scanned into PDF
predicting metadata labels in court filings. On a dataset of about            form. Thus, a common format for these documents is an image,
2500 annotated scanned PDF images with 21 labels, we found                    rather than plain text or XML. Moreover, recovering the layout of
that traditional classifiers such as MaxEnt achieved an average               native PDF documents can itself be challenging, as described
F1-score of 0.44 (micro-averaged across labels), with the highest             below.
label (Body) at 0.88. However, a 1-dimensional sequences in the
text, Mallet’s CRF implementation, achieved an average F1-score               There are tools available for image analysis, as well as for
of 0.6 across all labels, with some labels as high as 0.91. These             converting documents to plain text or XML, such as Apache Tika.
results demonstrate the value of using sequence models over                   But further challenges arise in how the information is laid out on
traditional classifiers in labeling the types of information in court         the page. There is some structure in the layout of a court filing;
filings.                                                                      the court is at the top of the page, with the parties below it, and
                                                                              the document number to the right of the parties. However, the
In: Proceedings of the Third Workshop on Automated Semantic                   actual physical position of this information can vary based on the
Analysis of Information in Legal Text (ASAIL 2019), June 21,                  amount of text and the conventions of the court. Many courts have
2019. Montreal, QC, Canada.                                                   small variations in how the information is presented, such as
© 2019 Copyright held by the owner/author(s). Copying                         right-justifying vs. centering the court, or putting the document
permitted for private and academic purposes. Published at                     number at the top of the page.
http://ceur-ws.org.                                                           Since there is no fixed location of information on each page, and
Approved for public release; distribution unlimited. Public                   rarely any indicative metadata, it becomes very difficult to
Release Case Number 9-1137. © 2019 The MITRE Corporation.                     automatically determine which piece of text is the court, the
All Rights Reserved.                                                          parties, and the document number. Additionally, things like
                                                                              stamps and signatures are often placed arbitrarily on the page,
1. INTRODUCTION                                                               introducing noise in any image-to-text conversion.
A court filing is a legal document submitted to a court that triggers         When a document image is converted into XML via a conversion
an event in a legal proceeding. Court filings indicate what the case          tool like Apache Tika, there are a number of features that can be
is about, why it should be in that court, and the grounds for the             taken from the new structure. In this paper, we attempt to assign
legal dispute. The legal effect of a filing depends critically on the         a label to each word using both lexical and positional features.
event that it is intended to trigger (e.g., dismissal, answer to              Positional features include the x and y position of each word, the
complaint, substitution of counsel), the role of the filer (plaintiff,        quadrant of the page it is in, and the distance from other words
defendant, court, intervener, etc.), the context of the filing (e.g.,         around it. Lexical features include the word itself, the word case,
the previous filing, if any, that it is intended to respond to),              the word type, and indicators of the word matching typical words
whether it has been properly signed, and other document                       in each type.
characteristics. Any process for automated analysis of court
filings must determine the contents of these fields, which we refer           In our analysis, we find that positional features alone are not
to as “metadata”, to distinguish it from the content of the body of           sufficient to classify most words, but reasonable performance can
a document.                                                                   be obtained by including both lexical and positional features.
A simple example of importance of automated metadata                          2. RELATED WORK
extraction is automated document quality control; that is,                    Several research communities have been active in document
detection of discrepancies between the document metadata (such                analysis, including historians, librarians, scientists, legal
as the case number) and the metadata specified by the filer (e.g.,            technologists, and those in government. Each community comes
the number of the case that the document was filed into). The shift           with a different set of data and goals, but all follow a similar
to electronic filing systems, such as the US Federal Judiciary’s              processing framework.
CM/ECF system, by increasing numbers of courts means that
filings are no longer inspected for errors by an intake clerk.                There are several ways approach the information extraction from
Instead, this function is often performed by quality-control staff.           documents. One of the first tasks is separating the elements of the
                                                                         1
page, a process called segmentation. (Mao, Rosenfeld, &                     This allows the scorer to ignore slight variations in the actual x/y
Kanungo, 2001) distinguish between physical and logical layout              locations of the blocks and focus on how much content is in
segmentation. Physical segmentation includes identifying the                common.
lines, spaces, blocks, and other elements on the page. Logical              Once the documents were annotated and converted into XML
segmentation seeks categorize these elements by their function              with labels, a toolchain was constructed to build models for
(e.g. headers, footers, content trees). Methods of logical                  automated inference of the textual (non-stamp) blocks given the
segmentation include rule-based approaches, comparison against              hOCR output.
knowledge-bases, and unsupervised learning.
                                                                            The fundamental problem with standard text-based approaches is
More recently, researchers have approached the problem by                   that the text on these pages is not running text, but rather in
converting the elements on the page into vectors and using                  blocks, so serializing the blocks in a standard line-oriented way
supervised machine-learning models to classify the logical                  may obscure the structure of the document and lead to problems
function of each element on the page. (Souafi-Bensafi, Parizeau,            applying standard structural techniques. Our hypothesis has been
Lebourgeois, & Emptoz, 2001), for example, identified a                     that using a graphical modeling inference strategy, allowing us to
hierarchy of geometric text blocks in various publications, and,            create much more structurally sophisticated contextual
along with typographical information, constructed a vector                  dependencies among elements, including 2-dimensional
representation for each word. They then used a Bayesian network             geometry, would enhance our ability to learn the location of these
classifier to label the logical function of each word.                      blocks.
Standard classifiers, such as SVMs, Bayes nets, and random                  Our strategy is an enhancement of the standard classification
forests, can be considered 0-dimensional models, in that they only          approach. Our goal has been to be able to compare multiple
consider the features of each token, but not the sequence of tokens         strategies to each other, including these strategies which build on
around it, Sequence learning algorithms, such as Conditional                these sophisticated contextual dependencies. Therefore, we've
Random Fields (CRF), can be considered 1-dimensional                        built a general-purpose experimentation harness for this family of
classifiers, in that they consider the features of the elements             classifiers.
before and after each token. (Trompper & Winkels) used a CRF
model to classify header types in Dutch court documents from                First, from the hOCR output for a given page, the tool constructs
XML and found that CRFs outperformed a deterministic tagger.                a set of features for each token in the document. These features
                                                                            can be atomic features, string-valued features, or float-valued
Two-dimensional sequence learners can consider sequences of                 features. These features include:
tokens in multiple directions and can thus exploit horizontal and
vertical relationships between elements in documents. In ‘2D                     •    case features, related to capitalization pattern of the
Conditional Random Fields for Web Information Extraction’,                            token
(Zhu, Nle, Wen, Zhang, & Ma) successfully used a 2D CRF to
                                                                                 •    digit and garbage features, related to the distribution
classify sections of web pages.
                                                                                      of digits and non-alphabetic characters in the token
In this paper, we focus on assigning logical labels to words in                  •    word and ngram features, related to the character
each court filing. We converted each scanned PDF into                                 sequence of the token
hierarchical OCR (XML) using Apache Tika and developed                           •    tag features, derived from applying the Stanford
positional and linguistic features for each word token. We then                       toolkit named entity tagger to the linearized text (these
compared 0-,1-, and 2-dimensional models to identify the relevant                     features are not likely to do much work for us, given
sections of the page.                                                                 the known problems with simply serializing this text
                                                                                      line-by-line)
3. APPROACH                                                                      •    similarity features which identify the best reasonably
In this work, a labeled dataset was constructed from scanned                          close match between the token and some of the
PDFs of court filings. This was done using an annotation tool                         metadata for the case for the document (e.g., the
called the MITRE Annotation Tool (MAT), developed by The                              names of the parties or attorneys)
MITRE Corporation. This tool contains resources for creating,
                                                                                 •    2-dimensional location features which indicate the
maintaining and scoring annotated corpora of page images. The                         position of the token on the page (what quadrant its in,
tool contains a set of annotation guidelines which we settled on                      and what percentage from the origin it is)
after a number of rounds of pilot annotation. These guidelines
                                                                                 •    margin features indicating words on the margin and
focus on the first and last pages of court filings and legal letters.
                                                                                      whether they're indented
The annotator is asked to locate the major, non-nested sections of
these pages (signatures, caption, court, body, etc.), as well as non-
text stamps (such as received stamps), which are annotated for              They can also be features on links between tokens, e.g., whether
future reference. The annotation tool is Web-based and provides             two tokens are farther apart than the average or median distance
a graphical tool for identifying blocks and labeling them. In               between tokens in the horizontal direction, or whether two tokens
comparison mode, the tool can compare two annotators' efforts to            are more than one line apart in the vertical direction or indicating
each other.                                                                 whether two tokens are on the same line.
The tool exploits a position-aware OCR output format known as               This array of features, then, provides two levels of position
hOCR, which presents each word along with its pixel-level                   sensitivity: first, on the token level, with the 2-dimensional
location block on the page from which it was extracted. This                location features, and second, with links between the tokens, for
position awareness allows us to score annotator blocks against              engines which recognize such features.
each other, by determining which words are within each annotator            We explored three classes of algorithms:
block and how many of the words are in common between blocks.
                                                                        2
       •    0-dimensional token classifiers, represented by a            amounting to about 3500 annotated pages (some documents are
            maximum-entropy algorithm, implemented separately            only one page long).
            by the MALLET1 engine, and by the Mandolin2
            engine.
       •    1-dimensional linear CRF, also implemented with the
            MALLET and Mandolin engine.
       •    2-dimensional CRF, where the dimension here refers
            not to geometric dimensions but abstract properties of
            the engine. Our goal, however, has been to use these
            properties to encode context dependencies in two
            dimensions. This was implemented only with
            Mandolin; a MALLET-equivalent (GRMM)
            implementation was attempted but unsuccessful.

Only the Mandolin engine explicitly represents links between
tokens. We model our 2 geometric dimensions by computing
unobstructed overlap between tokens in the vertical direction, as
well as using line adjacency in the horizontal direction. Only the
2-dimensional model captures feature information in the vertical
direction in our approach.                                               Figure 2: Examples of a case caption and footer with labeled fields.
                                                                         Each court document contains the name of the court, the parties
                                                                         in the case, the case number, and the document title. Each word
                                                                         in the document is extracted, and positional and lexical features
                                                                         are determined from the words and their context. Several machine
                                                                         learning algorithms were then used to construct models to predict
                                                                         the labels based on the training data.


                                                                         Figure 3: Number of occurrences of each type of word
                                                                         The data was separated into batches, with each batch containing
                                                                         about 150 documents. Overall, about 22 batches were used for
Figure 1: Examples of the varied structure of court filings              training, and 2 batches were used for testing. Within each batch,
                                                                         each document was divided into words, with features assigned to
                                                                         each word based on its positional and lexical elements. The
4. DATA                                                                  number of words with each label vary, with the body containing
Our corpus consists of the first and last pages drawn from               the most words on average, and the caption a distant second, as
approximately 2500 court filings, PDFs typified by Figure 1,             illustrated in Figure 3.


1                                                                        2
    http://mallet.cs.umass.edu/                                              http://project-mandolin.github.io/mandolin/index.html
                                                                     3
                                                                              0.40       Lex          stanford_lemma                Lemmatized
                                                                                                                                      token
                                                                              0.37       Lex    uncacheableAtomicFeatures           n-grams of
                                                                                                                                    token text
                                                                              0.22       Pos              right_indent              Indentation
                                                                                                                                    from Right
                                                                                                                                      Margin
                                                                              0.16       Pos                 v_half                Top or Bottom
                                                                                                                                    of Document
                                                                              0.13       Lex                 wText                  Token Text
                                                                              0.12       Lex              stanford_pos             Part-of-Speech
                                                                            Figure 6: Information Gain metrics for top 10 features
                                                                            6. EXPERIMENTS
                                                                            Our hypothesis is that a combination of positional, lexical, and
                                                                            contextual information can be used to determine the function of
                                                                            each word on the page. To test this, a dataset of case metadata was
                                                                            developed, with features extracted about each token, including
                                                                            positional and lexical information. Each token was then assigned
Figure 4: Sample locations of words, colored by label
                                                                            to the metadata label of the field where it occurred, and a machine
Labels tend to occupy certain regions consistently, though their            learning algorithm was trained to predict the label based on the
actual position can vary greatly. As an illustration of this, we            features of the token. This was treated as a multi-label training
plotted the X and Y coordinates of a sample of words, colored by            task in which the F1 score was calculated separately for all tokens
label, in Figure 4.                                                         occurring in each documents field, i.e., for each label (document
                                                                            title, case caption, etc.). This may enable future researchers, who
5. FEATURE EXTRACTION                                                       are interested in only a subset of the metadata, to get a baseline
Around 25 features of the data were identified and extracted for            for the difficulty of extraction.
each token, characterizing its positional, linguistic, and contextual
information.                                                                An ablation study was performed in which each model was
                                                                            evaluated with each type of feature, lexical or positional, present
Using Weka’s ‘Information Gain’ evaluator, the features were                or absent in order to determine its relative contribution to
ranked according to the predictive value they provide. The most             classification accuracy (s. Finally, the predictive accuracies of
highly ranked features, pct_y_from_origin and                               several alternative predictive models were compared, including
pct_y_from_origin, represent the position of the token on the               standard classifiers with 1D and 2D sequence models. The results
page. After that, entryType, stanford_lemma, and                            of each of these experiments in terms of mean F1-score across all
uncacheableAtomicFeatures deal with the linguistic properties               labels is shown in Figures 7-9.
of the data. Finally, otherWText and right_indent deal with the
relative positional information of the token to other tokens in the         7. RESULTS
text.                                                                       As Figure 7 shows, some metadata labels are reliably predictable
Each type of feature, positional, lexical, and contextual, help the         using a combination of positional and lexical features. The degree
model determine the role of the text on the page. While that                of accuracy on some labels, as high as 90%, could be useful in
doesn’t necessarily mean that is the information humans use to              many extraction tasks. Further, results were significantly
make the determination, it is a relatively intuitive result: the            improved by adding both positional and lexical features, and by
position, type of word, and relation to the words around it, all            using models that consider sequences, such as CRFs.
indicate the function of each word in the text.                                   Label              F1                Label            F1
  Info     Type                 Feature                 Description                  body          0.91               doc_title        0.68
  Gain
                                                                                     court          0.9               case_type        0.67
  0.44      Pos           pct_y_from_origin                Vertical
                                                        Distance from           date_addr          0.84               typed_sig        0.64
                                                            Origin              valediction        0.82          form_number           0.63
  0.41      Pos           pct_x_from_origin              Horizontal             salutation         0.79                venue           0.62
                                                        Distance from
                                                                                 caption           0.78           signer_info          0.59
                                                            Origin
                                                                                recip_info         0.72                 date           0.55
  0.41      Lex                entryType                Named Entity
                                                           Type               case_number          0.72                cc_info         0.46
  0.41      Lex               otherWText                    Word               performative        0.71          notary_block          0.46
                                                          following             case_title         0.71               letterhead       0.42
                                                            token

                                                                        4
Figure 7: Top 20 metadata labels by max F1 score                           models has an intuitive appeal, we did not find that they increased
In particular, when the word value of a token (i.e., Token Text)           accuracy over the 1D sequence model.
was the sole feature, non-sequence models classified most tokens
as ‘Body,’ with a small proportion tagged as ‘Court’. This appears
                                                                           9. FUTURE WORK
                                                                           There are several areas for improvement in this task. In general,
to be due to the fact that ‘Court’ words are a very small and
                                                                           some rigorous error analysis could be performed to identify major
specific set, including ‘UNITED’, ‘STATES’, ‘DISTRICT’, and
                                                                           classes of errors. Further tuning of the models may also improve
‘COURT’. Curiously, adding positional features to a standard
                                                                           results, and additional training data may allow for other models
classifier did little to improve the results. However, when other
                                                                           such as Neural Nets. Additionally, a more mature 2D CRF
lexical information was included, the F1-measure increased
                                                                           implementation, such as GRMM might improve performance.
greatly for most other labels. Including both lexical and positional
features improves the results even more, as shown in Figure 8.             Finally, while initial work aims to label each word in a document,
This is consistent across each of the model types and shows that           using these labels to predict the label of the ‘block’ that the text
while the token’s position on the page is important, the lexical           is in, is the longer-term objective. This would facilitate
properties of that token also play a significant role in identifying       information extraction from the entire block.
its label.
                                                                           10. ACKNOWLEDGMENTS
                                                                           Thanks to Ben Wellner for providing the Mandolin models and
       Lexical/Positional (F1)            No Lex          Lex              support, and to Stacy Petersen, Grace Sullivan, and Ariana
                                                                           Kellogg for annotating the court filings. Special thanks to Sam
               No Pos                      0.23          0.089
                                                                           Bayer for developing the training and testing framework.
                 Pos                       0.45          0.52
Figure 8: Average model F1 scores across all labels and models,
                                                                           11. REFERENCES
                                                                           Apache Tika - a content analysis toolkit. https://tika.apache.org/.
with and without lexical and positional features
                                                                           Clausner, C., Pletschacher, S., & Antonacopoulos, A. (2014). Document
                                                                           Representation Refinement for Precise Region Description. DaTeCH.
In comparing the types of classifiers, the CRF’s outperformed              Madrid, Spain: ACM.
standard classifiers in all cases. We used two different                   Eskenazi, S., Gomez, P., and Jean-Ogier, M., A comprehensive survey of
implementations of CRFs: Mallet CRF, and Mandolin CRF. We                  mostly textual document segmentation algorithms since 2008, Pattern
chose to compare Mallet to Mandolin because Mandolin could be              Recognition 64 (2017) 1-14.
used for standard classification, 1D and 2D analysis, while Mallet         Gabdulkhakova, A., & Hassan, T. (2012). Document Understanding of
only included the standard classifier and 1D CRF. However, the             Graphical Content in Natively Digital PDF Documents. DocEng’12
Mallet CRF has been around for quite some time, and likely                 (pp.137-140). Paris, France: ACM.
benefitted from significant tuning. Consistent with this surmise,          Klampfl, S., & Kern, R. (2015). Machine Learning Techniques for
Mallet outperforms Mandolin’s 1D and 2D CRFs, as shown in                  Automatically Extracting Contextual Information from Scientific
Figure 9.                                                                  Publications. SemWebEval 2015, 105-116.
                                                                           Klampfl, S., Granitzer, M., Jack, K., & Kern, R. (2014). Unsupervised
                                                                           document structure analysis of digital scientific articles.
  Models/Dimensions (F1)         MaxEnt        CRF      2D CRF             Int.J.Digit.Libr.,14, 3-4 (August 2014), 83-99.
           Mallet                  0.27        0.44                        Konstas, I., & Lapate, M. (2013). Inducing Document Plans for Concept-
                                                                           to-text Generation. Proceedings of EMNLP 2013, 1503-1514. Seattle,
         Mandolin                  0.28        0.37        0.39            Washington: Association for Computational Linguistics.
Figure 9: Average model F1 scores across all labels and features,          Lebourgeois, F. (1996). Localisation de textes dans un image a` niveaux.
organized by algorithm and dimensionality                                  Colloque National sur l'Ecrit et le Document.
                                                                           Mao, S., Rosenfeld, A., & Kanungo, T. (2003). Document Structure
                                                                           Analysis Algorithms: A Literature Survey. SPIE Electronic Imaging
However, comparing Mandolin’s 1D to its 2D CRFs, we see that               5010:197-207.
most labels had an improved F-measure with the 2D. That leads
                                                                           Mencia, E. L. (2009). Segmentation of Legal Documents. ICAIL’09, 88-
us to believe that with further tuning, the 2D CRF could do quite          97. Barcelona, Spain: Association of Computing Machinery.
well, but the Mallet 1D CRF had the best results overall in these
experiments.                                                               O'Gorman, L. (1993). The Document Spectrum for Page Layout Analysis.
                                                                           IEEE Trans. on Pat. Analysis and Machine Intelligence, Vol 15 No. 11.
8. CONCLUSIONS                                                             Ramakrishnan, C., Patnia, A., Hovy, E., & Burns, G. A. (2012). Layout-
In these experiments, we found that the metadata labels (i.e.,             aware text extraction from full-text pdf of scientific articles. Source Code
fields) of case captions and footers in US Federal court filings can       for Biology and Medicine, 7:7.
be predicted using a combination of positional and lexical                 Souafi-Bensafi, S., Parizeau, M., Lebourgeois, F., & Emptoz, H. (2001).
information. Accuracy was higher for much higher for some                  Logical Labeling using Bayesian Networks. 6th Int. Conf. on Doc. Anal.
fields, such as body, case type, and court, than others, such as the       Trompper, M., & Winkels, R. (2016). Automatic Assignment of Section
sender and signer info, are harder to identify. The best                   Structure to Texts of Dutch Court Judgements, JURIX 2016, 167-172.
performance was observed from the Mallet CRF, indicating that
                                                                           Zhu, J., Nle, Z., Wen, J.-R., Zhang, B., & Ma, W.-Y. (2005). 2D
sequence-learning techniques perform better than 1D classifiers            Conditional Random Fields for Web Information Extraction. Proceedings
in the domain of court filings. While the utility of 2D sequence           of the 22nd International Conference on Machine Learning. Bonn,
                                                                           Germany.

                                                                       5