=Paper=
{{Paper
|id=Vol-2482/paper31
|storemode=property
|title=Structural Analysis of Contract Renewals
|pdfUrl=https://ceur-ws.org/Vol-2482/paper31.pdf
|volume=Vol-2482
|authors=Frieda Josi,Christian Wartena
|dblpUrl=https://dblp.org/rec/conf/cikm/JosiW18
}}
==Structural Analysis of Contract Renewals==
<pdf width="1500px">https://ceur-ws.org/Vol-2482/paper31.pdf</pdf>
<pre>
                  Structural Analysis of Contract Renewals

                      Frieda Josi                                        Christian Wartena
             frieda.josi@hs-hannover.de                          christian.wartena@hs-hannover.de
                                      University of Applied Sciences and Arts Hanover
                                         Expo Plaza 12, 30539 Hanover, Germany


                                                                  intelligent support for the legal domain, we present an
                                                                  approach in which we convert contracts, based on PDF
                         Abstract                                 documents, into a structured XML format in order to
                                                                  efficiently find the changed, added or deleted clauses
     In the present paper we sketch an automated                  in the new contract version.
     procedure to compare different versions of a                     For all changed clauses we will predict the impact of
     contract. The contract texts used for this pur-              the change, or at least determine whether the change is
     pose are structurally differently composed PDF               only a stylistic or linguistic improvement or correction
     files that are converted into structured XML                 or whether the interpretation of the clause is touched.
     files by identifying and classifying text boxes.             Furthermore, for all changed and new clauses we will
     A classifier trained on manually annotated con-              check whether the clause is part of a collection of stan-
     tracts achieves an accuracy of 87% on this                   dard clauses or was used in another contract before. In
     task. We align contract versions and classify                the present paper, we demonstrate a first version of the
     aligned text fragments into different similarity             detection of changes in the contracts. Our procedure
     classes that enhance the manual comparison                   was developed and evaluated with German contract
     of changes in document versions. The main                    texts, but the method is language agnostic and can be
     challenges are to deal with OCR errors and                   applied to contracts in other languages as well.
     different layout of identical or similar texts.
                                                                      For the development of the methods we got access
    We demonstrate the procedure using some                       to a collection of 100,000 contracts of an insurance
    freely available contracts from the City of Ham-              company. Since the contracts cannot be made available
    burg written in German. The methods, how-                     publicly, we used a small set of freely available contracts
    ever, are language agnostic and can be applied                for the present study.
    to other contracts as well.                                       Our approach basically consists of four steps: first
                                                                  we extract rectangular text areas from the PDF doc-
1    Introduction                                                 ument. In the second step we classify all text areas
Most contracts between insurance and reinsurance com-             into structural classes like header, footer, heading, etc.
panies are updated annually. This results in many ver-            and merge some adjacent areas of the same type. On
sions of a contract which are structurally and content-           the base of this structure two documents are aligned.
wise similar, but which must be completely checked                Finally, the aligned text areas are compared in more
again for a new contract approval. A main obstacle                detail. An overview of the process flow of our structure
for efficient comparison of old and new versions of the           analysis of versions of legal texts is shown in the Figure
contracts is the fact that the entire approval process is         1.
paper based. Insurance companies might send paper                     In the following we describe related work, a detailed
versions of the contracts to several reinsurance com-             description of the approach and an evaluation of the
panies, each of which put stamps and signs on the                 classifier trained for classification of the text areas.
contract.
   Of course all contracts are scanned and stored elec-           2    Related Work
tronically, but the paper version is in the lead. As              Gao et al. (Gao et al., 2011) use a method similar to
Copyright © CIKM 2018 for the individual papers by the papers'    ours for PDF files to analyze the structure of books.
authors. Copyright © CIKM 2018 for the volume as a collection     After converting the PDF, the content is extracted into
by its editors. This volume and its papers are published under
                                                                  a physical and logical structure, the text modules are
the Creative Commons License Attribution 4.0 International (CC
BY 4.0).
                                                            as an indicator for a text element in the header or footer
                                                            and the length of the text. With the coordinates of the
                                                            text blocks in the PDF files a structural sorting per
                                                            page is possible. The recognition and merging of con-
                                                            tiguous text blocks from extracted PDF files is e.g. used
                                                            by Ramakrishnan et al. (2012). There is some work
                                                            dealing with extracting named entities (such as compa-
                                                            nies, persons, places, etc.) from legal texts or finding
                                                            references to laws (Dozier et al., 2010; Schweighofer,
                                                            2010; Nanda et al., 2017). In (Nanda et al., 2017) the
                                                            vocabulary IATE (Inter-Active Terminology for Eu-
                                                            rope) is used to create an annotated corpus of named
                                                            entities and to use it for the NER for European and
                                                            British legal documents. Chalkidis et al. (2017) use a
                                                            combination of state-of-the-art methods (such as word
                                                            embeddings, and part-of-speech tag embeddings) to
                                                            extract typical contract elements from contract texts.
                                                            The conversion of content from the layout format of
                                                            a PDF file to the structured format of an XML file
                                                            with a small amount of human interaction is done as
                                                            described by Paick and Zhang (2004). The similarity of
                                                            the contract versions is compared with the text blocks
                                                            of the XML output. The word overlap is used as a
                                                            measure for the agreement between two text blocks of
                                                            the contract changes. This approach is described by
                                                            Klampfl et al. (2014).

                                                            3     Legal text structure analysis
                                                            This section describes our approach to analyze the PDF
                                                            structure and finding the differences between contract
                                                            versions.
                                                               A simple line by line comparison of documents makes
                                                            no sense, since the addition of a single word already can
                                                            change the position of line or page breaks. Furthermore,
Figure 1: Procedure of structure analysis of versions
                                                            contracts are usually highly structured texts with lists
of legal texts.
                                                            of definitions, figures, headers and footers on each page.
parsed and displayed. However, since these are books,       Figure 2 gives an example page of one of the contracts
the Gao et al. could assume that all pages have the         we used. A simple extraction of all text will disturb
same layout. This enabled the definition of global ty-      the natural text flow and insert header and footer text
pographies. The authors divided the logical structure       at arbitrary points in the contract text. Thus, we
into a page level and a document level. The page level      prefer to extract blocks of texts, align the blocks of two
contains the hierarchical order of the text elements, the   documents and compare the document block by block.
header, figures, tables and footnotes. The document
level included the writers’ chapter structure and meta-     3.1    Document collection
data. For the extraction of the logical structure at page   For training a classifier we use 4 non-public insurance
level, the texts and individual letters were extracted      documents and 3 publicly available contracts. These
from these text blocks to obtain additional characteris-    contracts are part of the open data strategy of the City
tics such as boldface for a heading. For example, the       Administration Hamburg1 . These 7 PDF documents
extraction of the logical structure at document level       consist in total of 198 pages.
contained the title of the book. For header and footer         From these pages we extracted 4046 text boxes using
recognition we use a layout-based approach similar to       PDFMiner2 and classified them by hand. Figure 3
that of Dèjean and Meunier (2006).                            1 Transparenzportal Hamburg:       http://transparenz.
   This approach is based on the use of geometric coor-     hamburg.de/
dinates. In addition, they use the occurrence of digits        2 PDFMiner: https://pypi.org/project/pdfminer/
                                                               Figure 3: Example of Manual Classification
                                                         3.2   Detection of structural elements
                                                         The 4046 text boxes from the contract texts were clas-
                                                         sified with the following classes: header, heading, enu-
                                                         meration, body text and footer. The twenty features
                                                         extracted or calculated for each text box are:

                                                           • the coordinates of the lower left and upper right
                                                             corners of the text box (x1, y1, x2, y2)
Figure 2: Example page from the contract texts for the     • the free margin on each side (m1, m2, m3, m4)
prediction model.                                          • the fact whether there is a neighboring text box
shows an example.                                            on each side (nb1, nb2, nb3, nb4)
   The insurance contracts are written in English, the     • the font styles bold and upper (bold, upper)
contracts from Hamburg in German. Since our ap-            • enumeration elements in text box (enum)
proach is completely language agnostic, the documents
                                                           • the size of the text box (area)
can be mixed for training without any problem.
   For the evaluation of the alignment and comparison      • height and width of the text box (height, width)
of contract versions we used 5 documents from the          • number of letters in text box (length)
City Administration Hamburg for which at least two
                                                           • fraction of special characters in text box (spec)
versions are available. In the process, care was taken
to ensure that there were different degrees of change.      We obtain the coordinates of the text boxes, the font
The selected contract versions were:                     (bold and upper) and the text of each box from the
                                                         parse of PDFMiner. The other features were calculated
  • HH1a/HH1b: version with additions                    based on this information. The feature ”enumeration”
                                                         indicates whether the text of the box matches the
  • HH2a/HH2b: very different (by many handwrit-         following regular expression (in Perl Syntax):
    ten notes)
                                                         ”\(?([0−9]+|[A−Za−z])(\.([0−9]+|[A−Za−z]))∗\)?$”
  • HH3a/HH3b: very similar contracts with differ-
    ent contractual partners                                The distances to the adjacent elements were calcu-
                                                         lated from their horizontal and vertical overlapping of
  • HH4a/HH4b: same, but different scanned at an         the coordinates and their distances to the right and
    angle                                                left element. The distance to the margins and the size
                                                         of the text field were also computed. The features for
  • HH5a/HH5b: year variants                             bold and for uppercase indicate whether all characters
                                                         in a text box are typeset in the respective way. Since
   The exact names and URLs of all test documents        headings are often written in this way, we expect this
used are given in the Appendix.                          to be a useful feature. From the text we calculate also
the fraction of special (non alpha-numeric) characters.
                                                                Table 1: Confusion Matrix from logistic regression
Finally, we have calculated the size, width and height
                                                                         Pred.
of the individual text fields.                                                   Header   Heading    Enum.     Text   Footer
   A SVM (Support Vector Machine) classifier with            Real
RBF Kernel was trained with this data set. The pa-           Header              701      4          3         8      0
                                                             Heading             12       359        8         217    1
rameters used for this are γ = 0.1 · 10−5 and penalty        Enum.               5        13         429       39     0
parameter C = 10. In addition we have calculated a           Text                7        161        40        1891   5
                                                             Footer              0        0          3         7      133
logistic regression model. The performance results of
SVM and logistic regression were almost identical. The
forecast values of the logistic regression are shown in         Table 2: Per class results from logistic regression
the ”Evaluation and Results” section.                                  Class     Precision      Recall     f1-score
                                                                       Header    0.97           0.98       0.97
                                                                       Heading   0.67           0.60       0.63
3.3   Alignment                                                        Enum      0.89           0.88       0.89
                                                                       Text      0.87           0.90       0.89
For layout-based structure analysis, we have sorted                    Footer    0.96           0.93       0.94
                                                                       Overall   0.87           0.87       0.87
the text elements on each page from top to bottom
and from left to right if they elements are placed next     on words (i.e. the minimum number of words that
to each other. Adjacent elements that have the same         have to be inserted, deleted or changed to obtain the
class and have a margin between the areas that is           new version from the old one). Then we compute the
smaller than the height of a text line are merged. Thus,    character based edit distance. If the character based
we correct a number of anomalies introduced by the          edit distance is at most 2.5 times larger than the word
detection of text areas. E.g., in many cases the last       based edit distance, all changes in the words are just
line of a paragraph is detected as a separate area, if it   small changes, replacing 2 or 3 characters. In this case
has only one or two words.                                  we assume that all changes are due to OCR errors.
   For the alignment of the text boxes we consider          However, we did not (yet) determine an optimal value
insertions, deletions and substitutions. For insertions     for this threshold.
and deletions we assign a penalty of 1. The penalty for
a substitutions of text t1 with t2 is defined as            4     Evaluation and Results
                                 v(t1 ) ∩ v(t2 )            4.1     Classifier
             D(t1 , t2 ) = 1 −
                                 v(t1 ) ∪ v(t2 )            Using 10-fold cross validation the accuracy of the clas-
where v(t) denotes the set of words, excluding stop         sifier (logistic regression) is 87%. The accuracy of the
words, of t. Using dynamic programming we find the          majority classifier, that assigns each element to the
alignment with the minimum sum of penalties.                class body text, is 52%.
   For the 10 test documents we find on average 24              As we can see from the confusion matrix (Table
text blocks per page after merging adjacent blocks.         1) and per class results (Table 2) the best results are
                                                            achieved for the most important classes: the header
3.4   Version Comparison                                    and footer. These classes contain text that is not part
                                                            of the contract text and has to be separated clearly.
Once two texts are aligned, we can start comparing          Most problems arise from confusion between headings
the documents. At the moment we do not analyze              and body text.
insertions and deletions. With a simple heuristic we            The contribution of each feature for the logistic
try to classify pairs of aligned text fragments. We         regression model is given in Figure 4. The boolean value
distinguish between:                                        for an enumeration, the features indicating whether
  • Identical: Texts are identical up to white spaces       there is a text element above and below (nb1+nb2) and
                                                            the fraction of special characters in a text element (spec)
  • OCR Errors: Texts are identical, but there are          are used most strongly. Interestingly, the position on
    differences due to OCR errors                           the page and the margins around a text box are hardly
  • Small Differences: At most 5 words inserted,            used.
    deleted or substituted
                                                            4.2     Comparison
  • Different: More than 5 words are changed
                                                            We use the logical structure of the contracts (heading,
To decide whether there are real differences or OCR         enumeration, body text) converted into an XML format
differences we align the texts two times. First we          for the comparison of contract renewals. The results of
tokenize the text and compute the edit distance based       the comparison for the test data can be seen in Table
                                                           Table 4: Examples version comparison for HH2a vs.
                                                           HH2b. Differences are marked in the text.

                                                                          für die Leistungen nach 3.2 die Kostenschätzung.
                                                            Identical
                                                                          für die Leistungen nach 3.2 die Kostenschätzung.
                                                                          6.1.3 (1) Grundlagenermittlung
                                                            OCR Errors
                                                                          6.1. 3 (1) GrundlageAermittlung ·
                                                                          Fertigstellung der leistungen dieses Vertrages bis
                                                                          Ende Juli 2012
                                                            Small Diff.   Fertigstellung der Leistungen dieses Vertrages bis
                                                                          Ende Oktober 2013
                                                                          2.4 ··i Die Baumaßnahme untCFliegt dem
                                                                          ZustimFF1uF1gS’1cffelhreF1 Flach § 84 HBauO.
         Figure 4: Relative Feature Importance                            Die für die veranhvortliene Leitung zuständige
                                                                          Person wird der bzw. dem AN sehriftlieh be nannt.
                                                                          2.4 ¿ Die Baumaßnahme uAterliegt dem
        Table 3: Evaluation version comparison                            Zustimmungsvcrfahren nach § 64 HBauO. Die
              HH1a/   HH2a/    HH3a/    HH4a/    HH5a/                    für eli9e verantwoftliehe Leitung zuständige
              HH1b    HH2b     HH3b     HH4b     HH5b                     Person wird der bzw.     dem AN schriftlich be
 Inserted     5       50       79       8        20                       ft8flflt:
 Different    4       65       228      43       104        Different
 Identical    15      75       186      58       24
 OCR Diff.    3       39       43       49       8
 Deleted      1       16       25       14       41                       § 8 - Ergänzende Vereinbarungen
 Total text   28      245      561      172      197
 boxes                                                                    und anderen fachlich Beteiligten
 Fraction     0.26    0.16     0.31     0.17     0.012
 identical
                                                           scribed e.g. in (Nanda et al., 2017). Furthermore, the
3. The extracted text boxes are compared as described      text structure can be subdivided in more detail and fur-
in section 3.4. As we can see here, for the text pair      ther structural elements such as text boxes containing
HH3a/HH3b, e.g., our method found 186 identical text       handwritten notes can be included. We will improve
boxes with a text length (measured in characters) of       our approach by carrying out further tests with a larger
30% of the contract. These two contracts consist of a      training corpus, making further parameter settings and
very similar structure but with different contractual      adding additional features such as font size. During
partners. This means that the underwriters no longer       the course of the project, the existing XML structure
have to check these passages in the text of the contract   also will be transformed into a standardized legal XML
for consistency, thus making their work more efficient.    structure, as proposed by ”OASIS LegalXML Elec-
                                                           tronic Court Filing TC”.3 On this basis we plan the
    As we can see in the Table 3 there are many text
                                                           clause analysis in the contract texts. The recognized
boxes that have received the comparison degree ”Differ-
                                                           clauses will be checked against a collection of model
ent”. Again, these are often OCR errors, but they are
                                                           clauses and the occurrence of the same or almost same
too numerous to be classified as ”OCR errors” (see the
                                                           clause in other contract will be checked. We plan to vi-
first example in Table 4 class ”Different”). The second
                                                           sualize the status of each clause, like unchanged, found
example in the class ”Different” shows that errors in
                                                           in another contract, etc.
segmentation and hierarchical sorting also lead to the
classification ”Different”. Another problem is that the       With the visualization of the changes in the contract
text boxes recognized by PDFMiner are not always the       renewals, a tool can then be implemented that provides
same in the two versions and merging does not entirely     valuable support for underwriters and other legal en-
compensate for this, e.g. because one of the elements      tities in their daily work and simplifies and improves
was classified incorrectly.                                their daily work in the long term.


5    Discussion and Future Work
                                                           Acknowledgements
In this paper we have shown that modifications in
contract renewals can be identified and analyzed using     The authors would like to thank Fabian Schmieder
supervised learning and text alignment.                    for many helpful discussions and pointing us to the
   We want to continue this approach in further work       publicly available contracts of the City of Hamburg.
and improve the classification of the classes heading,
body text and enumeration. In addition, we want to            3 OASIS    LegalXML:     https://www.oasis-open.org/
implement the recognition of named entities, as de-        committees/tc_home.php?wg_abbrev=legalxml-courtfiling
References                                                 Klampfl, S., M. Granitzer, K. Jack, and R. Kern (2014).
                                                             Unsupervised document structure analysis of digital
Chalkidis, I., I. Androutsopoulos, and A. Michos (2017).
                                                             scientific articles.
  Extracting contract elements.
                                                           Nanda, R., G. Siragusa, L. Di Caro, M. Theobald,
Dèjean, H. and J.-L. Meunier (2006). A system for
                                                            G. Boella, L. Robaldo, and F. Costamagna (2017).
  converting PDF documents into structured XML
                                                            Concept recognition in european and national law.
  format. In Document Analysis Systems VII, Lecture
                                                            In A. Z. Wyner and G. Casini (Eds.), Legal Knowl-
  Notes in Computer Science, pp. 129–140. Springer,
                                                            edge and Information Systems - JURIX 2017: The
  Berlin, Heidelberg.
                                                            Thirtieth Annual Conference, Luxembourg, 13-15 De-
Dozier, C., R. Kondadadi, M. Light, A. Vachher,             cember 2017, Frontiers in Artificial Intelligence and
 S. Veeramachaneni, and R. Wudali (2010). Named             Applications, pp. 193. IOS Press.
 entity recognition and resolution in legal text. In
                                                           Paick, Y. Y. K. and Y. P. Y. Zhang (2004). PDF2xml:
 Semantic Processing of Legal Texts, Lecture Notes
                                                             Converting PDF to XML.
 in Computer Science, pp. 27–43. Springer, Berlin,
 Heidelberg.                                               Ramakrishnan, C., A. Patnia, E. Hovy, and G. A.
                                                            Burns (2012). Layout-aware text extraction from
Gao, L., Z. Tang, X. Lin, Y. Liu, R. Qiu, and Y. Wang
                                                            full-text PDF of scientific articles.
 (2011). Structure extraction from PDF-based book
 documents. In Proceedings of the 11th Annual In-          Schweighofer, E. (2010). Semantic indexing of legal
 ternational ACM/IEEE Joint Conference on Digital            documents. In Semantic Processing of Legal Texts,
 Libraries, JCDL ’11, pp. 11–20. ACM.                        Lecture Notes in Computer Science, pp. 157–169.
                                                             Springer, Berlin, Heidelberg.
Appendix: Used Contracts
                                                     Training Documents
Reference       File name                                       URL
HHTrain1        Akte 611.10-13(1).pdf                           http://suche.transparenz.hamburg.de/dataset/oeffentlich-
                                                                rechtlicher-vertrag-gehrecht-bebauungsplan-harburg-59-theodor-
                                                                york-strasse?forceWeb=true
HHTrain2        Akte FB2a.809.13-25 4(1).pdf                    http://suche.transparenz.hamburg.de/dataset/aenderungsverfahren-
                                                                fuer-vertrag-6328-zuvex-weitere-schritte-zur-anbindung-externer-
                                                                nutzer?forceWeb=true
HHTrain3        Akte FB2a.800.01-2 3(1).pdf                     http://suche.transparenz.hamburg.de/dataset/v6921-
                                                                unterstuetzungsleistung-mobility-vertrag?forceWeb=true
Train data1-4   4 non public reinsurance contracts
                                                      Test Documents
Reference       File name                                      URL
HH1a            Aenderungsbescheid.pdf                         http://suche.transparenz.hamburg.de/dataset/3-planen-zur-
                                                               temporaeren-anbringung-an-einem-baugeruest-zur-bewerbung-
                                                               von-mietwohnungen?forceWeb=true
HH1b            Befristete Genehmigung nach HBauO.pdf          http://suche.transparenz.hamburg.de/dataset/3-planen-zur-
                                                               temporaeren-anbringung-an-einem-baugeruest-zur-bewerbung-
                                                               von-mietwohnungen1?forceWeb=true
HH2a            Akte 000.00-04.pdf                             http://suche.transparenz.hamburg.de/dataset/vertrag-spielplatz-
                                                               voigtstrasse-ii?forceWeb=true
HH2b            Akte 000.00-04(1).pdf                          http://suche.transparenz.hamburg.de/dataset/vertrag-spielplatz-
                                                               voigtstrasse?forceWeb=true
HH3a            Akte FB63.51-06(1).pdf                         http://suche.transparenz.hamburg.de/dataset/bezirk-
                                                               eimsbuettel-vereinbarung-ueber-die-erstmalige-endgueltige-
                                                               herstellung-von-erschl-02-2014?forceWeb=true
HH3b            Akte FB63.51-06(3).pdf                         http://suche.transparenz.hamburg.de/dataset/bezirk-hamburg-
                                                               nord-vereinbarung-ueber-die-erstmalige-endgueltige-herstellung-
                                                               von-ersch-02-2014?forceWeb=true
HH4a            Akte G103-36.01 06-10-.pdf                     http://suche.transparenz.hamburg.de/dataset/aenderungsvertrag-
                                                               zum-vertrag-zwischen-der-freien-und-hansestadt-hamburg-fhh-
                                                               und-dem-ha-12-20161?forceWeb=true
HH4b            Akte G103-36.01 06-10-(1).pdf                  http://suche.transparenz.hamburg.de/dataset/aenderungsvertrag-
                                                               zum-vertrag-zwischen-der-freien-und-hansestadt-hamburg-fhh-
                                                               und-dem-hamburger-?forceWeb=true
HH5a            entwurf-eines-gesetzes-zu-dem-abkommen-zur- http://www.buergerschaft-hh.de/ParlDok/dokument/53849/entwurf-
                dritten-änderung-des-abkommens-über-das-     eines-gesetzes-zu-dem-abkommen-zur-dritten-%c3%a4nderung-
                deutsche-institut-für-bautechnik.pdf          des-abkommens-%c3%bcber-das-deutsche-institut-f%c3%bcr-
                                                               bautechnik.pdf
HH5b            entwurf-eines-gesetzes-zu-dem-abkommen-zur- http://www.buergerschaft-hh.de/ParlDok/dokument/37131/entwurf-
                zweiten-änderung-des-abkommens-über-das-     eines-gesetzes-zu-dem-abkommen-zur-zweiten-%c3%a4nderung-
                deutsche-institut-für-bautechnik-und-zum-     des-abkommens-%c3%bcber-das-deutsche-institut-f%c3%bcr-
                erlass-des-bauprodukte-mar.pdf                 bautechnik-und-zum-erlass-des-bauprodukte-mar.pdf

</pre>