=Paper=
{{Paper
|id=Vol-2482/paper31
|storemode=property
|title=Structural Analysis of Contract Renewals
|pdfUrl=https://ceur-ws.org/Vol-2482/paper31.pdf
|volume=Vol-2482
|authors=Frieda Josi,Christian Wartena
|dblpUrl=https://dblp.org/rec/conf/cikm/JosiW18
}}
==Structural Analysis of Contract Renewals==
Structural Analysis of Contract Renewals Frieda Josi Christian Wartena frieda.josi@hs-hannover.de christian.wartena@hs-hannover.de University of Applied Sciences and Arts Hanover Expo Plaza 12, 30539 Hanover, Germany intelligent support for the legal domain, we present an approach in which we convert contracts, based on PDF Abstract documents, into a structured XML format in order to efficiently find the changed, added or deleted clauses In the present paper we sketch an automated in the new contract version. procedure to compare different versions of a For all changed clauses we will predict the impact of contract. The contract texts used for this pur- the change, or at least determine whether the change is pose are structurally differently composed PDF only a stylistic or linguistic improvement or correction files that are converted into structured XML or whether the interpretation of the clause is touched. files by identifying and classifying text boxes. Furthermore, for all changed and new clauses we will A classifier trained on manually annotated con- check whether the clause is part of a collection of stan- tracts achieves an accuracy of 87% on this dard clauses or was used in another contract before. In task. We align contract versions and classify the present paper, we demonstrate a first version of the aligned text fragments into different similarity detection of changes in the contracts. Our procedure classes that enhance the manual comparison was developed and evaluated with German contract of changes in document versions. The main texts, but the method is language agnostic and can be challenges are to deal with OCR errors and applied to contracts in other languages as well. different layout of identical or similar texts. For the development of the methods we got access We demonstrate the procedure using some to a collection of 100,000 contracts of an insurance freely available contracts from the City of Ham- company. Since the contracts cannot be made available burg written in German. The methods, how- publicly, we used a small set of freely available contracts ever, are language agnostic and can be applied for the present study. to other contracts as well. Our approach basically consists of four steps: first we extract rectangular text areas from the PDF doc- 1 Introduction ument. In the second step we classify all text areas Most contracts between insurance and reinsurance com- into structural classes like header, footer, heading, etc. panies are updated annually. This results in many ver- and merge some adjacent areas of the same type. On sions of a contract which are structurally and content- the base of this structure two documents are aligned. wise similar, but which must be completely checked Finally, the aligned text areas are compared in more again for a new contract approval. A main obstacle detail. An overview of the process flow of our structure for efficient comparison of old and new versions of the analysis of versions of legal texts is shown in the Figure contracts is the fact that the entire approval process is 1. paper based. Insurance companies might send paper In the following we describe related work, a detailed versions of the contracts to several reinsurance com- description of the approach and an evaluation of the panies, each of which put stamps and signs on the classifier trained for classification of the text areas. contract. Of course all contracts are scanned and stored elec- 2 Related Work tronically, but the paper version is in the lead. As Gao et al. (Gao et al., 2011) use a method similar to Copyright © CIKM 2018 for the individual papers by the papers' ours for PDF files to analyze the structure of books. authors. Copyright © CIKM 2018 for the volume as a collection After converting the PDF, the content is extracted into by its editors. This volume and its papers are published under a physical and logical structure, the text modules are the Creative Commons License Attribution 4.0 International (CC BY 4.0). as an indicator for a text element in the header or footer and the length of the text. With the coordinates of the text blocks in the PDF files a structural sorting per page is possible. The recognition and merging of con- tiguous text blocks from extracted PDF files is e.g. used by Ramakrishnan et al. (2012). There is some work dealing with extracting named entities (such as compa- nies, persons, places, etc.) from legal texts or finding references to laws (Dozier et al., 2010; Schweighofer, 2010; Nanda et al., 2017). In (Nanda et al., 2017) the vocabulary IATE (Inter-Active Terminology for Eu- rope) is used to create an annotated corpus of named entities and to use it for the NER for European and British legal documents. Chalkidis et al. (2017) use a combination of state-of-the-art methods (such as word embeddings, and part-of-speech tag embeddings) to extract typical contract elements from contract texts. The conversion of content from the layout format of a PDF file to the structured format of an XML file with a small amount of human interaction is done as described by Paick and Zhang (2004). The similarity of the contract versions is compared with the text blocks of the XML output. The word overlap is used as a measure for the agreement between two text blocks of the contract changes. This approach is described by Klampfl et al. (2014). 3 Legal text structure analysis This section describes our approach to analyze the PDF structure and finding the differences between contract versions. A simple line by line comparison of documents makes no sense, since the addition of a single word already can change the position of line or page breaks. Furthermore, Figure 1: Procedure of structure analysis of versions contracts are usually highly structured texts with lists of legal texts. of definitions, figures, headers and footers on each page. parsed and displayed. However, since these are books, Figure 2 gives an example page of one of the contracts the Gao et al. could assume that all pages have the we used. A simple extraction of all text will disturb same layout. This enabled the definition of global ty- the natural text flow and insert header and footer text pographies. The authors divided the logical structure at arbitrary points in the contract text. Thus, we into a page level and a document level. The page level prefer to extract blocks of texts, align the blocks of two contains the hierarchical order of the text elements, the documents and compare the document block by block. header, figures, tables and footnotes. The document level included the writers’ chapter structure and meta- 3.1 Document collection data. For the extraction of the logical structure at page For training a classifier we use 4 non-public insurance level, the texts and individual letters were extracted documents and 3 publicly available contracts. These from these text blocks to obtain additional characteris- contracts are part of the open data strategy of the City tics such as boldface for a heading. For example, the Administration Hamburg1 . These 7 PDF documents extraction of the logical structure at document level consist in total of 198 pages. contained the title of the book. For header and footer From these pages we extracted 4046 text boxes using recognition we use a layout-based approach similar to PDFMiner2 and classified them by hand. Figure 3 that of Dèjean and Meunier (2006). 1 Transparenzportal Hamburg: http://transparenz. This approach is based on the use of geometric coor- hamburg.de/ dinates. In addition, they use the occurrence of digits 2 PDFMiner: https://pypi.org/project/pdfminer/ Figure 3: Example of Manual Classification 3.2 Detection of structural elements The 4046 text boxes from the contract texts were clas- sified with the following classes: header, heading, enu- meration, body text and footer. The twenty features extracted or calculated for each text box are: • the coordinates of the lower left and upper right corners of the text box (x1, y1, x2, y2) Figure 2: Example page from the contract texts for the • the free margin on each side (m1, m2, m3, m4) prediction model. • the fact whether there is a neighboring text box shows an example. on each side (nb1, nb2, nb3, nb4) The insurance contracts are written in English, the • the font styles bold and upper (bold, upper) contracts from Hamburg in German. Since our ap- • enumeration elements in text box (enum) proach is completely language agnostic, the documents • the size of the text box (area) can be mixed for training without any problem. For the evaluation of the alignment and comparison • height and width of the text box (height, width) of contract versions we used 5 documents from the • number of letters in text box (length) City Administration Hamburg for which at least two • fraction of special characters in text box (spec) versions are available. In the process, care was taken to ensure that there were different degrees of change. We obtain the coordinates of the text boxes, the font The selected contract versions were: (bold and upper) and the text of each box from the parse of PDFMiner. The other features were calculated • HH1a/HH1b: version with additions based on this information. The feature ”enumeration” indicates whether the text of the box matches the • HH2a/HH2b: very different (by many handwrit- following regular expression (in Perl Syntax): ten notes) ”\(?([0−9]+|[A−Za−z])(\.([0−9]+|[A−Za−z]))∗\)?$” • HH3a/HH3b: very similar contracts with differ- ent contractual partners The distances to the adjacent elements were calcu- lated from their horizontal and vertical overlapping of • HH4a/HH4b: same, but different scanned at an the coordinates and their distances to the right and angle left element. The distance to the margins and the size of the text field were also computed. The features for • HH5a/HH5b: year variants bold and for uppercase indicate whether all characters in a text box are typeset in the respective way. Since The exact names and URLs of all test documents headings are often written in this way, we expect this used are given in the Appendix. to be a useful feature. From the text we calculate also the fraction of special (non alpha-numeric) characters. Table 1: Confusion Matrix from logistic regression Finally, we have calculated the size, width and height Pred. of the individual text fields. Header Heading Enum. Text Footer A SVM (Support Vector Machine) classifier with Real RBF Kernel was trained with this data set. The pa- Header 701 4 3 8 0 Heading 12 359 8 217 1 rameters used for this are γ = 0.1 · 10−5 and penalty Enum. 5 13 429 39 0 parameter C = 10. In addition we have calculated a Text 7 161 40 1891 5 Footer 0 0 3 7 133 logistic regression model. The performance results of SVM and logistic regression were almost identical. The forecast values of the logistic regression are shown in Table 2: Per class results from logistic regression the ”Evaluation and Results” section. Class Precision Recall f1-score Header 0.97 0.98 0.97 Heading 0.67 0.60 0.63 3.3 Alignment Enum 0.89 0.88 0.89 Text 0.87 0.90 0.89 For layout-based structure analysis, we have sorted Footer 0.96 0.93 0.94 Overall 0.87 0.87 0.87 the text elements on each page from top to bottom and from left to right if they elements are placed next on words (i.e. the minimum number of words that to each other. Adjacent elements that have the same have to be inserted, deleted or changed to obtain the class and have a margin between the areas that is new version from the old one). Then we compute the smaller than the height of a text line are merged. Thus, character based edit distance. If the character based we correct a number of anomalies introduced by the edit distance is at most 2.5 times larger than the word detection of text areas. E.g., in many cases the last based edit distance, all changes in the words are just line of a paragraph is detected as a separate area, if it small changes, replacing 2 or 3 characters. In this case has only one or two words. we assume that all changes are due to OCR errors. For the alignment of the text boxes we consider However, we did not (yet) determine an optimal value insertions, deletions and substitutions. For insertions for this threshold. and deletions we assign a penalty of 1. The penalty for a substitutions of text t1 with t2 is defined as 4 Evaluation and Results v(t1 ) ∩ v(t2 ) 4.1 Classifier D(t1 , t2 ) = 1 − v(t1 ) ∪ v(t2 ) Using 10-fold cross validation the accuracy of the clas- where v(t) denotes the set of words, excluding stop sifier (logistic regression) is 87%. The accuracy of the words, of t. Using dynamic programming we find the majority classifier, that assigns each element to the alignment with the minimum sum of penalties. class body text, is 52%. For the 10 test documents we find on average 24 As we can see from the confusion matrix (Table text blocks per page after merging adjacent blocks. 1) and per class results (Table 2) the best results are achieved for the most important classes: the header 3.4 Version Comparison and footer. These classes contain text that is not part of the contract text and has to be separated clearly. Once two texts are aligned, we can start comparing Most problems arise from confusion between headings the documents. At the moment we do not analyze and body text. insertions and deletions. With a simple heuristic we The contribution of each feature for the logistic try to classify pairs of aligned text fragments. We regression model is given in Figure 4. The boolean value distinguish between: for an enumeration, the features indicating whether • Identical: Texts are identical up to white spaces there is a text element above and below (nb1+nb2) and the fraction of special characters in a text element (spec) • OCR Errors: Texts are identical, but there are are used most strongly. Interestingly, the position on differences due to OCR errors the page and the margins around a text box are hardly • Small Differences: At most 5 words inserted, used. deleted or substituted 4.2 Comparison • Different: More than 5 words are changed We use the logical structure of the contracts (heading, To decide whether there are real differences or OCR enumeration, body text) converted into an XML format differences we align the texts two times. First we for the comparison of contract renewals. The results of tokenize the text and compute the edit distance based the comparison for the test data can be seen in Table Table 4: Examples version comparison for HH2a vs. HH2b. Differences are marked in the text. für die Leistungen nach 3.2 die Kostenschätzung. Identical für die Leistungen nach 3.2 die Kostenschätzung. 6.1.3 (1) Grundlagenermittlung OCR Errors 6.1. 3 (1) GrundlageAermittlung · Fertigstellung der leistungen dieses Vertrages bis Ende Juli 2012 Small Diff. Fertigstellung der Leistungen dieses Vertrages bis Ende Oktober 2013 2.4 ··i Die Baumaßnahme untCFliegt dem ZustimFF1uF1gS’1cffelhreF1 Flach § 84 HBauO. Figure 4: Relative Feature Importance Die für die veranhvortliene Leitung zuständige Person wird der bzw. dem AN sehriftlieh be nannt. 2.4 ¿ Die Baumaßnahme uAterliegt dem Table 3: Evaluation version comparison Zustimmungsvcrfahren nach § 64 HBauO. Die HH1a/ HH2a/ HH3a/ HH4a/ HH5a/ für eli9e verantwoftliehe Leitung zuständige HH1b HH2b HH3b HH4b HH5b Person wird der bzw. dem AN schriftlich be Inserted 5 50 79 8 20 ft8flflt: Different 4 65 228 43 104 Different Identical 15 75 186 58 24 OCR Diff. 3 39 43 49 8 Deleted 1 16 25 14 41 § 8 - Ergänzende Vereinbarungen Total text 28 245 561 172 197 boxes und anderen fachlich Beteiligten Fraction 0.26 0.16 0.31 0.17 0.012 identical scribed e.g. in (Nanda et al., 2017). Furthermore, the 3. The extracted text boxes are compared as described text structure can be subdivided in more detail and fur- in section 3.4. As we can see here, for the text pair ther structural elements such as text boxes containing HH3a/HH3b, e.g., our method found 186 identical text handwritten notes can be included. We will improve boxes with a text length (measured in characters) of our approach by carrying out further tests with a larger 30% of the contract. These two contracts consist of a training corpus, making further parameter settings and very similar structure but with different contractual adding additional features such as font size. During partners. This means that the underwriters no longer the course of the project, the existing XML structure have to check these passages in the text of the contract also will be transformed into a standardized legal XML for consistency, thus making their work more efficient. structure, as proposed by ”OASIS LegalXML Elec- tronic Court Filing TC”.3 On this basis we plan the As we can see in the Table 3 there are many text clause analysis in the contract texts. The recognized boxes that have received the comparison degree ”Differ- clauses will be checked against a collection of model ent”. Again, these are often OCR errors, but they are clauses and the occurrence of the same or almost same too numerous to be classified as ”OCR errors” (see the clause in other contract will be checked. We plan to vi- first example in Table 4 class ”Different”). The second sualize the status of each clause, like unchanged, found example in the class ”Different” shows that errors in in another contract, etc. segmentation and hierarchical sorting also lead to the classification ”Different”. Another problem is that the With the visualization of the changes in the contract text boxes recognized by PDFMiner are not always the renewals, a tool can then be implemented that provides same in the two versions and merging does not entirely valuable support for underwriters and other legal en- compensate for this, e.g. because one of the elements tities in their daily work and simplifies and improves was classified incorrectly. their daily work in the long term. 5 Discussion and Future Work Acknowledgements In this paper we have shown that modifications in contract renewals can be identified and analyzed using The authors would like to thank Fabian Schmieder supervised learning and text alignment. for many helpful discussions and pointing us to the We want to continue this approach in further work publicly available contracts of the City of Hamburg. and improve the classification of the classes heading, body text and enumeration. In addition, we want to 3 OASIS LegalXML: https://www.oasis-open.org/ implement the recognition of named entities, as de- committees/tc_home.php?wg_abbrev=legalxml-courtfiling References Klampfl, S., M. Granitzer, K. Jack, and R. Kern (2014). Unsupervised document structure analysis of digital Chalkidis, I., I. Androutsopoulos, and A. Michos (2017). scientific articles. Extracting contract elements. Nanda, R., G. Siragusa, L. Di Caro, M. Theobald, Dèjean, H. and J.-L. Meunier (2006). A system for G. Boella, L. Robaldo, and F. Costamagna (2017). converting PDF documents into structured XML Concept recognition in european and national law. format. In Document Analysis Systems VII, Lecture In A. Z. Wyner and G. Casini (Eds.), Legal Knowl- Notes in Computer Science, pp. 129–140. Springer, edge and Information Systems - JURIX 2017: The Berlin, Heidelberg. Thirtieth Annual Conference, Luxembourg, 13-15 De- Dozier, C., R. Kondadadi, M. Light, A. Vachher, cember 2017, Frontiers in Artificial Intelligence and S. Veeramachaneni, and R. Wudali (2010). Named Applications, pp. 193. IOS Press. entity recognition and resolution in legal text. In Paick, Y. Y. K. and Y. P. Y. Zhang (2004). PDF2xml: Semantic Processing of Legal Texts, Lecture Notes Converting PDF to XML. in Computer Science, pp. 27–43. Springer, Berlin, Heidelberg. Ramakrishnan, C., A. Patnia, E. Hovy, and G. A. Burns (2012). Layout-aware text extraction from Gao, L., Z. Tang, X. Lin, Y. Liu, R. Qiu, and Y. Wang full-text PDF of scientific articles. (2011). Structure extraction from PDF-based book documents. In Proceedings of the 11th Annual In- Schweighofer, E. (2010). Semantic indexing of legal ternational ACM/IEEE Joint Conference on Digital documents. In Semantic Processing of Legal Texts, Libraries, JCDL ’11, pp. 11–20. ACM. Lecture Notes in Computer Science, pp. 157–169. Springer, Berlin, Heidelberg. Appendix: Used Contracts Training Documents Reference File name URL HHTrain1 Akte 611.10-13(1).pdf http://suche.transparenz.hamburg.de/dataset/oeffentlich- rechtlicher-vertrag-gehrecht-bebauungsplan-harburg-59-theodor- york-strasse?forceWeb=true HHTrain2 Akte FB2a.809.13-25 4(1).pdf http://suche.transparenz.hamburg.de/dataset/aenderungsverfahren- fuer-vertrag-6328-zuvex-weitere-schritte-zur-anbindung-externer- nutzer?forceWeb=true HHTrain3 Akte FB2a.800.01-2 3(1).pdf http://suche.transparenz.hamburg.de/dataset/v6921- unterstuetzungsleistung-mobility-vertrag?forceWeb=true Train data1-4 4 non public reinsurance contracts Test Documents Reference File name URL HH1a Aenderungsbescheid.pdf http://suche.transparenz.hamburg.de/dataset/3-planen-zur- temporaeren-anbringung-an-einem-baugeruest-zur-bewerbung- von-mietwohnungen?forceWeb=true HH1b Befristete Genehmigung nach HBauO.pdf http://suche.transparenz.hamburg.de/dataset/3-planen-zur- temporaeren-anbringung-an-einem-baugeruest-zur-bewerbung- von-mietwohnungen1?forceWeb=true HH2a Akte 000.00-04.pdf http://suche.transparenz.hamburg.de/dataset/vertrag-spielplatz- voigtstrasse-ii?forceWeb=true HH2b Akte 000.00-04(1).pdf http://suche.transparenz.hamburg.de/dataset/vertrag-spielplatz- voigtstrasse?forceWeb=true HH3a Akte FB63.51-06(1).pdf http://suche.transparenz.hamburg.de/dataset/bezirk- eimsbuettel-vereinbarung-ueber-die-erstmalige-endgueltige- herstellung-von-erschl-02-2014?forceWeb=true HH3b Akte FB63.51-06(3).pdf http://suche.transparenz.hamburg.de/dataset/bezirk-hamburg- nord-vereinbarung-ueber-die-erstmalige-endgueltige-herstellung- von-ersch-02-2014?forceWeb=true HH4a Akte G103-36.01 06-10-.pdf http://suche.transparenz.hamburg.de/dataset/aenderungsvertrag- zum-vertrag-zwischen-der-freien-und-hansestadt-hamburg-fhh- und-dem-ha-12-20161?forceWeb=true HH4b Akte G103-36.01 06-10-(1).pdf http://suche.transparenz.hamburg.de/dataset/aenderungsvertrag- zum-vertrag-zwischen-der-freien-und-hansestadt-hamburg-fhh- und-dem-hamburger-?forceWeb=true HH5a entwurf-eines-gesetzes-zu-dem-abkommen-zur- http://www.buergerschaft-hh.de/ParlDok/dokument/53849/entwurf- dritten-änderung-des-abkommens-über-das- eines-gesetzes-zu-dem-abkommen-zur-dritten-%c3%a4nderung- deutsche-institut-für-bautechnik.pdf des-abkommens-%c3%bcber-das-deutsche-institut-f%c3%bcr- bautechnik.pdf HH5b entwurf-eines-gesetzes-zu-dem-abkommen-zur- http://www.buergerschaft-hh.de/ParlDok/dokument/37131/entwurf- zweiten-änderung-des-abkommens-über-das- eines-gesetzes-zu-dem-abkommen-zur-zweiten-%c3%a4nderung- deutsche-institut-für-bautechnik-und-zum- des-abkommens-%c3%bcber-das-deutsche-institut-f%c3%bcr- erlass-des-bauprodukte-mar.pdf bautechnik-und-zum-erlass-des-bauprodukte-mar.pdf