-

Structural Analysis of Contract Renewals

University of Applied Sciences and Arts Hanover

0 0 Expo Plaza 12 , 30539 Hanover , Germany

06 10

In the present paper we sketch an automated procedure to compare di erent versions of a contract. The contract texts used for this purpose are structurally di erently composed PDF les that are converted into structured XML les by identifying and classifying text boxes. A classi er trained on manually annotated contracts achieves an accuracy of 87% on this task. We align contract versions and classify aligned text fragments into di erent similarity classes that enhance the manual comparison of changes in document versions. The main challenges are to deal with OCR errors and di erent layout of identical or similar texts. We demonstrate the procedure using some freely available contracts from the City of Hamburg written in German. The methods, however, are language agnostic and can be applied to other contracts as well.

Most contracts between insurance and reinsurance companies are updated annually. This results in many versions of a contract which are structurally and contentwise similar, but which must be completely checked again for a new contract approval. A main obstacle for e cient comparison of old and new versions of the contracts is the fact that the entire approval process is paper based. Insurance companies might send paper versions of the contracts to several reinsurance companies, each of which put stamps and signs on the contract.

Of course all contracts are scanned and stored electronically, but the paper version is in the lead. As Copyright © CIKM 2018 for the individual papers by the papers' authors. Copyright © CIKM 2018 for the volume as a collection by its editors. This volume and its papers are published under intelligent support for the legal domain, we present an approach in which we convert contracts, based on PDF documents, into a structured XML format in order to e ciently nd the changed, added or deleted clauses in the new contract version.

For all changed clauses we will predict the impact of the change, or at least determine whether the change is only a stylistic or linguistic improvement or correction or whether the interpretation of the clause is touched. Furthermore, for all changed and new clauses we will check whether the clause is part of a collection of standard clauses or was used in another contract before. In the present paper, we demonstrate a rst version of the detection of changes in the contracts. Our procedure was developed and evaluated with German contract texts, but the method is language agnostic and can be applied to contracts in other languages as well.

For the development of the methods we got access to a collection of 100,000 contracts of an insurance company. Since the contracts cannot be made available publicly, we used a small set of freely available contracts for the present study.

Our approach basically consists of four steps: rst we extract rectangular text areas from the PDF document. In the second step we classify all text areas into structural classes like header, footer, heading, etc. and merge some adjacent areas of the same type. On the base of this structure two documents are aligned. Finally, the aligned text areas are compared in more detail. An overview of the process ow of our structure analysis of versions of legal texts is shown in the Figure 1.

In the following we describe related work, a detailed description of the approach and an evaluation of the classi er trained for classi cation of the text areas. 2

Related Work

Gao et al. (Gao et al., 2011) use a method similar to ours for PDF les to analyze the structure of books. After converting the PDF, the content is extracted into a physical and logical structure, the text modules are parsed and displayed. However, since these are books, the Gao et al. could assume that all pages have the same layout. This enabled the de nition of global typographies. The authors divided the logical structure into a page level and a document level. The page level contains the hierarchical order of the text elements, the header, gures, tables and footnotes. The document level included the writers' chapter structure and metadata. For the extraction of the logical structure at page level, the texts and individual letters were extracted from these text blocks to obtain additional characteristics such as boldface for a heading. For example, the extraction of the logical structure at document level contained the title of the book. For header and footer recognition we use a layout-based approach similar to that of Dejean and Meunier (2006) .

This approach is based on the use of geometric coordinates. In addition, they use the occurrence of digits as an indicator for a text element in the header or footer and the length of the text. With the coordinates of the text blocks in the PDF les a structural sorting per page is possible. The recognition and merging of contiguous text blocks from extracted PDF les is e.g. used by Ramakrishnan et al. (2012) . There is some work dealing with extracting named entities (such as companies, persons, places, etc.) from legal texts or nding references to laws (Dozier et al., 2010; Schweighofer, 2010; Nanda et al., 2017) . In (Nanda et al., 2017) the vocabulary IATE (Inter-Active Terminology for Europe) is used to create an annotated corpus of named entities and to use it for the NER for European and British legal documents. Chalkidis et al. (2017) use a combination of state-of-the-art methods (such as word embeddings, and part-of-speech tag embeddings) to extract typical contract elements from contract texts. The conversion of content from the layout format of a PDF le to the structured format of an XML le with a small amount of human interaction is done as described by Paick and Zhang (2004) . The similarity of the contract versions is compared with the text blocks of the XML output. The word overlap is used as a measure for the agreement between two text blocks of the contract changes. This approach is described by Klamp et al. (2014) . 3

Legal text structure analysis

This section describes our approach to analyze the PDF structure and nding the di erences between contract versions.

A simple line by line comparison of documents makes no sense, since the addition of a single word already can change the position of line or page breaks. Furthermore, contracts are usually highly structured texts with lists of de nitions, gures, headers and footers on each page. Figure 2 gives an example page of one of the contracts we used. A simple extraction of all text will disturb the natural text ow and insert header and footer text at arbitrary points in the contract text. Thus, we prefer to extract blocks of texts, align the blocks of two documents and compare the document block by block. 3.1

Document collection

For training a classi er we use 4 non-public insurance documents and 3 publicly available contracts. These contracts are part of the open data strategy of the City Administration Hamburg1. These 7 PDF documents consist in total of 198 pages.

From these pages we extracted 4046 text boxes using PDFMiner2 and classi ed them by hand. Figure 3 1Transparenzportal Hamburg: hamburg.de/ 2PDFMiner: https://pypi.org/project/pdfminer/ http://transparenz. shows an example.

The insurance contracts are written in English, the contracts from Hamburg in German. Since our approach is completely language agnostic, the documents can be mixed for training without any problem.

For the evaluation of the alignment and comparison of contract versions we used 5 documents from the City Administration Hamburg for which at least two versions are available. In the process, care was taken to ensure that there were di erent degrees of change. The selected contract versions were:

HH1a/HH1b: version with additions

HH2a/HH2b: very di erent (by many handwritten notes) HH3a/HH3b: very similar contracts with di erent contractual partners HH4a/HH4b: same, but di erent scanned at an angle

HH5a/HH5b: year variants

The exact names and URLs of all test documents used are given in the Appendix. We obtain the coordinates of the text boxes, the font (bold and upper) and the text of each box from the parse of PDFMiner. The other features were calculated based on this information. The feature "enumeration" indicates whether the text of the box matches the following regular expression (in Perl Syntax): "n(?([0 9]+j[A Za z])(n:([0 9]+j[A Za z])) n)?$"

The distances to the adjacent elements were calculated from their horizontal and vertical overlapping of the coordinates and their distances to the right and left element. The distance to the margins and the size of the text eld were also computed. The features for bold and for uppercase indicate whether all characters in a text box are typeset in the respective way. Since headings are often written in this way, we expect this to be a useful feature. From the text we calculate also the fraction of special (non alpha-numeric) characters. Finally, we have calculated the size, width and height of the individual text elds.

A SVM (Support Vector Machine) classi er with RBF Kernel was trained with this data set. The parameters used for this are = 0:1 10 5 and penalty parameter C = 10. In addition we have calculated a logistic regression model. The performance results of SVM and logistic regression were almost identical. The forecast values of the logistic regression are shown in the "Evaluation and Results" section. 3.3

Alignment

For layout-based structure analysis, we have sorted the text elements on each page from top to bottom and from left to right if they elements are placed next to each other. Adjacent elements that have the same class and have a margin between the areas that is smaller than the height of a text line are merged. Thus, we correct a number of anomalies introduced by the detection of text areas. E.g., in many cases the last line of a paragraph is detected as a separate area, if it has only one or two words.

For the alignment of the text boxes we consider insertions, deletions and substitutions. For insertions and deletions we assign a penalty of 1. The penalty for a substitutions of text t1 with t2 is de ned as D(t1; t2) = 1 v(t1) \ v(t2) v(t1) [ v(t2) where v(t) denotes the set of words, excluding stop words, of t. Using dynamic programming we nd the alignment with the minimum sum of penalties.

For the 10 test documents we nd on average 24 text blocks per page after merging adjacent blocks. 3.4

Version Comparison

Once two texts are aligned, we can start comparing the documents. At the moment we do not analyze insertions and deletions. With a simple heuristic we try to classify pairs of aligned text fragments. We distinguish between:

Identical: Texts are identical up to white spaces OCR Errors: Texts are identical, but there are di erences due to OCR errors Small Di erences: At most 5 words inserted, deleted or substituted

Di erent: More than 5 words are changed To decide whether there are real di erences or OCR di erences we align the texts two times. First we tokenize the text and compute the edit distance based

on words (i.e. the minimum number of words that have to be inserted, deleted or changed to obtain the new version from the old one). Then we compute the character based edit distance. If the character based edit distance is at most 2:5 times larger than the word based edit distance, all changes in the words are just small changes, replacing 2 or 3 characters. In this case we assume that all changes are due to OCR errors. However, we did not (yet) determine an optimal value for this threshold. Using 10-fold cross validation the accuracy of the classi er (logistic regression) is 87%. The accuracy of the majority classi er, that assigns each element to the class body text, is 52%.

As we can see from the confusion matrix (Table 1) and per class results (Table 2) the best results are achieved for the most important classes: the header and footer. These classes contain text that is not part of the contract text and has to be separated clearly. Most problems arise from confusion between headings and body text.

The contribution of each feature for the logistic regression model is given in Figure 4. The boolean value for an enumeration, the features indicating whether there is a text element above and below (nb1+nb2) and the fraction of special characters in a text element (spec) are used most strongly. Interestingly, the position on the page and the margins around a text box are hardly used. We use the logical structure of the contracts (heading, enumeration, body text) converted into an XML format for the comparison of contract renewals. The results of the comparison for the test data can be seen in Table 3. The extracted text boxes are compared as described in section 3.4. As we can see here, for the text pair HH3a/HH3b, e.g., our method found 186 identical text boxes with a text length (measured in characters) of 30% of the contract. These two contracts consist of a very similar structure but with di erent contractual partners. This means that the underwriters no longer have to check these passages in the text of the contract for consistency, thus making their work more e cient.

As we can see in the Table 3 there are many text boxes that have received the comparison degree "Di erent". Again, these are often OCR errors, but they are too numerous to be classi ed as "OCR errors" (see the rst example in Table 4 class "Di erent"). The second example in the class "Di erent" shows that errors in segmentation and hierarchical sorting also lead to the classi cation "Di erent". Another problem is that the text boxes recognized by PDFMiner are not always the same in the two versions and merging does not entirely compensate for this, e.g. because one of the elements was classi ed incorrectly. 5

Discussion and Future Work

In this paper we have shown that modi cations in contract renewals can be identi ed and analyzed using supervised learning and text alignment.

We want to continue this approach in further work and improve the classi cation of the classes heading, body text and enumeration. In addition, we want to implement the recognition of named entities, as described e.g. in (Nanda et al., 2017) . Furthermore, the text structure can be subdivided in more detail and further structural elements such as text boxes containing handwritten notes can be included. We will improve our approach by carrying out further tests with a larger training corpus, making further parameter settings and adding additional features such as font size. During the course of the project, the existing XML structure also will be transformed into a standardized legal XML structure, as proposed by "OASIS LegalXML Electronic Court Filing TC".3 On this basis we plan the clause analysis in the contract texts. The recognized clauses will be checked against a collection of model clauses and the occurrence of the same or almost same clause in other contract will be checked. We plan to visualize the status of each clause, like unchanged, found in another contract, etc.

With the visualization of the changes in the contract renewals, a tool can then be implemented that provides valuable support for underwriters and other legal entities in their daily work and simpli es and improves their daily work in the long term.

Acknowledgements

The authors would like to thank Fabian Schmieder for many helpful discussions and pointing us to the publicly available contracts of the City of Hamburg.

3OASIS LegalXML: https://www.oasis-open.org/ committees/tc_home.php?wg_abbrev=legalxml-courtfiling

Reference HHTrain1 HHTrain2 HHTrain3

4 non public reinsurance contracts Reference HH1a

File name

Aenderungsbescheid.pdf HH1b HH2a HH2b HH3a HH3b HH4a HH4b HH5a HH5b

Befristete Genehmigung nach HBauO.pdf Akte 000.00-04.pdf Akte 000.00-04(1).pdf Akte FB63.51-06(1).pdf Akte FB63.51-06(3).pdf

Chalkidis , I. , I.

Androutsopoulos, and

A . Michos ( 2017 ). Extracting contract elements .

Dejean , H.

and

J.-L. Meunier ( 2006 ). A system for converting PDF documents into structured XML format . In Document Analysis Systems VII, Lecture Notes in Computer Science , pp. 129 { 140 . Springer, Berlin, Heidelberg.

Dozier , C. ,

Kondadadi ,

Light ,

Vachher ,

Veeramachaneni , and

Wudali ( 2010 ). Named entity recognition and resolution in legal text . In Semantic Processing of Legal Texts, Lecture Notes in Computer Science , pp. 27 { 43 . Springer, Berlin, Heidelberg.

Gao , L. ,

Tang ,

Lin ,

Liu ,

Qiu , and

Wang ( 2011 ). Structure extraction from PDF-based book documents . In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL '11 , pp. 11 { 20 . ACM.

Klamp , S. ,

Granitzer ,

Jack , and

Kern ( 2014 ). Unsupervised document structure analysis of digital scienti c articles .

Nanda , R. ,

Siragusa ,

L. Di

Caro ,

Theobald ,

Boella ,

Robaldo , and

Costamagna ( 2017 ). Concept recognition in european and national law . In A. Z. Wyner and G. Casini (Eds.), Legal Knowledge and Information Systems - JURIX 2017 : The Thirtieth Annual Conference , Luxembourg, 13 - 15 December 2017, Frontiers in Arti cial Intelligence and Applications , pp. 193 . IOS Press.

Paick , Y. Y. K. and

Y. P. Y.

Zhang ( 2004 ). PDF2xml: Converting PDF to XML.

Ramakrishnan , C. ,

Patnia , E. Hovy, and

G. A.

Burns ( 2012 ). Layout-aware text extraction from full-text PDF of scienti c articles .

Schweighofer , E. ( 2010 ). Semantic indexing of legal documents . In Semantic Processing of Legal Texts, Lecture Notes in Computer Science , pp. 157 { 169 . Springer, Berlin, Heidelberg.

Akte 611 . 10 - 13 ( 1 ).pdf

Akte FB2a.809.13-25 4 ( 1 ).pdf

Akte FB2a.800.01-2 3 ( 1 ).pdf

Train data1-4

Training Documents

URL

http://suche.transparenz.hamburg.de/dataset/oe entlichrechtlicher-vertrag -gehrecht-bebauungsplan- harburg- 59 - theodoryork-strasse?forceWeb=true http://suche.transparenz.hamburg.de/dataset/aenderungsverfahrenfuer-vertrag-6328 - zuvex -weitere-schritte-zur-anbindung-externernutzer?forceWeb=true http://suche .transparenz.hamburg.de/dataset/v6921- unterstuetzungsleistung -mobility-vertrag?forceWeb=true