<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Structural Analysis of Contract Renewals</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>University of Applied Sciences and Arts Hanover</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Expo Plaza 12</institution>
          ,
          <addr-line>30539 Hanover</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>06</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>In the present paper we sketch an automated procedure to compare di erent versions of a contract. The contract texts used for this purpose are structurally di erently composed PDF les that are converted into structured XML les by identifying and classifying text boxes. A classi er trained on manually annotated contracts achieves an accuracy of 87% on this task. We align contract versions and classify aligned text fragments into di erent similarity classes that enhance the manual comparison of changes in document versions. The main challenges are to deal with OCR errors and di erent layout of identical or similar texts. We demonstrate the procedure using some freely available contracts from the City of Hamburg written in German. The methods, however, are language agnostic and can be applied to other contracts as well.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Most contracts between insurance and reinsurance
companies are updated annually. This results in many
versions of a contract which are structurally and
contentwise similar, but which must be completely checked
again for a new contract approval. A main obstacle
for e cient comparison of old and new versions of the
contracts is the fact that the entire approval process is
paper based. Insurance companies might send paper
versions of the contracts to several reinsurance
companies, each of which put stamps and signs on the
contract.</p>
      <p>Of course all contracts are scanned and stored
electronically, but the paper version is in the lead. As
Copyright © CIKM 2018 for the individual papers by the papers'
authors. Copyright © CIKM 2018 for the volume as a collection
by its editors. This volume and its papers are published under
intelligent support for the legal domain, we present an
approach in which we convert contracts, based on PDF
documents, into a structured XML format in order to
e ciently nd the changed, added or deleted clauses
in the new contract version.</p>
      <p>For all changed clauses we will predict the impact of
the change, or at least determine whether the change is
only a stylistic or linguistic improvement or correction
or whether the interpretation of the clause is touched.
Furthermore, for all changed and new clauses we will
check whether the clause is part of a collection of
standard clauses or was used in another contract before. In
the present paper, we demonstrate a rst version of the
detection of changes in the contracts. Our procedure
was developed and evaluated with German contract
texts, but the method is language agnostic and can be
applied to contracts in other languages as well.</p>
      <p>For the development of the methods we got access
to a collection of 100,000 contracts of an insurance
company. Since the contracts cannot be made available
publicly, we used a small set of freely available contracts
for the present study.</p>
      <p>Our approach basically consists of four steps: rst
we extract rectangular text areas from the PDF
document. In the second step we classify all text areas
into structural classes like header, footer, heading, etc.
and merge some adjacent areas of the same type. On
the base of this structure two documents are aligned.
Finally, the aligned text areas are compared in more
detail. An overview of the process ow of our structure
analysis of versions of legal texts is shown in the Figure
1.</p>
      <p>In the following we describe related work, a detailed
description of the approach and an evaluation of the
classi er trained for classi cation of the text areas.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Gao et al.
        <xref ref-type="bibr" rid="ref4">(Gao et al., 2011)</xref>
        use a method similar to
ours for PDF les to analyze the structure of books.
After converting the PDF, the content is extracted into
a physical and logical structure, the text modules are
parsed and displayed. However, since these are books,
the Gao et al. could assume that all pages have the
same layout. This enabled the de nition of global
typographies. The authors divided the logical structure
into a page level and a document level. The page level
contains the hierarchical order of the text elements, the
header, gures, tables and footnotes. The document
level included the writers' chapter structure and
metadata. For the extraction of the logical structure at page
level, the texts and individual letters were extracted
from these text blocks to obtain additional
characteristics such as boldface for a heading. For example, the
extraction of the logical structure at document level
contained the title of the book. For header and footer
recognition we use a layout-based approach similar to
that of
        <xref ref-type="bibr" rid="ref2">Dejean and Meunier (2006)</xref>
        .
      </p>
      <p>
        This approach is based on the use of geometric
coordinates. In addition, they use the occurrence of digits
as an indicator for a text element in the header or footer
and the length of the text. With the coordinates of the
text blocks in the PDF les a structural sorting per
page is possible. The recognition and merging of
contiguous text blocks from extracted PDF les is e.g. used
by
        <xref ref-type="bibr" rid="ref8">Ramakrishnan et al. (2012)</xref>
        . There is some work
dealing with extracting named entities (such as
companies, persons, places, etc.) from legal texts or nding
references to laws
        <xref ref-type="bibr" rid="ref3 ref6 ref9">(Dozier et al., 2010; Schweighofer,
2010; Nanda et al., 2017)</xref>
        . In
        <xref ref-type="bibr" rid="ref6">(Nanda et al., 2017)</xref>
        the
vocabulary IATE (Inter-Active Terminology for
Europe) is used to create an annotated corpus of named
entities and to use it for the NER for European and
British legal documents.
        <xref ref-type="bibr" rid="ref1">Chalkidis et al. (2017)</xref>
        use a
combination of state-of-the-art methods (such as word
embeddings, and part-of-speech tag embeddings) to
extract typical contract elements from contract texts.
The conversion of content from the layout format of
a PDF le to the structured format of an XML le
with a small amount of human interaction is done as
described by
        <xref ref-type="bibr" rid="ref7">Paick and Zhang (2004)</xref>
        . The similarity of
the contract versions is compared with the text blocks
of the XML output. The word overlap is used as a
measure for the agreement between two text blocks of
the contract changes. This approach is described by
        <xref ref-type="bibr" rid="ref5">Klamp et al. (2014)</xref>
        .
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Legal text structure analysis</title>
      <p>This section describes our approach to analyze the PDF
structure and nding the di erences between contract
versions.</p>
      <p>A simple line by line comparison of documents makes
no sense, since the addition of a single word already can
change the position of line or page breaks. Furthermore,
contracts are usually highly structured texts with lists
of de nitions, gures, headers and footers on each page.
Figure 2 gives an example page of one of the contracts
we used. A simple extraction of all text will disturb
the natural text ow and insert header and footer text
at arbitrary points in the contract text. Thus, we
prefer to extract blocks of texts, align the blocks of two
documents and compare the document block by block.
3.1</p>
      <sec id="sec-3-1">
        <title>Document collection</title>
        <p>For training a classi er we use 4 non-public insurance
documents and 3 publicly available contracts. These
contracts are part of the open data strategy of the City
Administration Hamburg1. These 7 PDF documents
consist in total of 198 pages.</p>
        <p>From these pages we extracted 4046 text boxes using
PDFMiner2 and classi ed them by hand. Figure 3
1Transparenzportal Hamburg:
hamburg.de/
2PDFMiner: https://pypi.org/project/pdfminer/
http://transparenz.
shows an example.</p>
        <p>The insurance contracts are written in English, the
contracts from Hamburg in German. Since our
approach is completely language agnostic, the documents
can be mixed for training without any problem.</p>
        <p>For the evaluation of the alignment and comparison
of contract versions we used 5 documents from the
City Administration Hamburg for which at least two
versions are available. In the process, care was taken
to ensure that there were di erent degrees of change.
The selected contract versions were:</p>
        <sec id="sec-3-1-1">
          <title>HH1a/HH1b: version with additions</title>
          <p>HH2a/HH2b: very di erent (by many
handwritten notes)
HH3a/HH3b: very similar contracts with di
erent contractual partners
HH4a/HH4b: same, but di erent scanned at an
angle</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>HH5a/HH5b: year variants</title>
          <p>The exact names and URLs of all test documents
used are given in the Appendix.
We obtain the coordinates of the text boxes, the font
(bold and upper) and the text of each box from the
parse of PDFMiner. The other features were calculated
based on this information. The feature "enumeration"
indicates whether the text of the box matches the
following regular expression (in Perl Syntax):
"n(?([0 9]+j[A Za z])(n:([0 9]+j[A Za z])) n)?$"</p>
          <p>The distances to the adjacent elements were
calculated from their horizontal and vertical overlapping of
the coordinates and their distances to the right and
left element. The distance to the margins and the size
of the text eld were also computed. The features for
bold and for uppercase indicate whether all characters
in a text box are typeset in the respective way. Since
headings are often written in this way, we expect this
to be a useful feature. From the text we calculate also
the fraction of special (non alpha-numeric) characters.
Finally, we have calculated the size, width and height
of the individual text elds.</p>
          <p>A SVM (Support Vector Machine) classi er with
RBF Kernel was trained with this data set. The
parameters used for this are = 0:1 10 5 and penalty
parameter C = 10. In addition we have calculated a
logistic regression model. The performance results of
SVM and logistic regression were almost identical. The
forecast values of the logistic regression are shown in
the "Evaluation and Results" section.
3.3</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Alignment</title>
        <p>For layout-based structure analysis, we have sorted
the text elements on each page from top to bottom
and from left to right if they elements are placed next
to each other. Adjacent elements that have the same
class and have a margin between the areas that is
smaller than the height of a text line are merged. Thus,
we correct a number of anomalies introduced by the
detection of text areas. E.g., in many cases the last
line of a paragraph is detected as a separate area, if it
has only one or two words.</p>
        <p>For the alignment of the text boxes we consider
insertions, deletions and substitutions. For insertions
and deletions we assign a penalty of 1. The penalty for
a substitutions of text t1 with t2 is de ned as
D(t1; t2) = 1
v(t1) \ v(t2)
v(t1) [ v(t2)
where v(t) denotes the set of words, excluding stop
words, of t. Using dynamic programming we nd the
alignment with the minimum sum of penalties.</p>
        <p>For the 10 test documents we nd on average 24
text blocks per page after merging adjacent blocks.
3.4</p>
      </sec>
      <sec id="sec-3-3">
        <title>Version Comparison</title>
        <p>Once two texts are aligned, we can start comparing
the documents. At the moment we do not analyze
insertions and deletions. With a simple heuristic we
try to classify pairs of aligned text fragments. We
distinguish between:</p>
        <p>Identical: Texts are identical up to white spaces
OCR Errors: Texts are identical, but there are
di erences due to OCR errors
Small Di erences: At most 5 words inserted,
deleted or substituted</p>
        <sec id="sec-3-3-1">
          <title>Di erent: More than 5 words are changed To decide whether there are real di erences or OCR di erences we align the texts two times. First we tokenize the text and compute the edit distance based</title>
          <p>on words (i.e. the minimum number of words that
have to be inserted, deleted or changed to obtain the
new version from the old one). Then we compute the
character based edit distance. If the character based
edit distance is at most 2:5 times larger than the word
based edit distance, all changes in the words are just
small changes, replacing 2 or 3 characters. In this case
we assume that all changes are due to OCR errors.
However, we did not (yet) determine an optimal value
for this threshold.
Using 10-fold cross validation the accuracy of the
classi er (logistic regression) is 87%. The accuracy of the
majority classi er, that assigns each element to the
class body text, is 52%.</p>
          <p>As we can see from the confusion matrix (Table
1) and per class results (Table 2) the best results are
achieved for the most important classes: the header
and footer. These classes contain text that is not part
of the contract text and has to be separated clearly.
Most problems arise from confusion between headings
and body text.</p>
          <p>The contribution of each feature for the logistic
regression model is given in Figure 4. The boolean value
for an enumeration, the features indicating whether
there is a text element above and below (nb1+nb2) and
the fraction of special characters in a text element (spec)
are used most strongly. Interestingly, the position on
the page and the margins around a text box are hardly
used.
We use the logical structure of the contracts (heading,
enumeration, body text) converted into an XML format
for the comparison of contract renewals. The results of
the comparison for the test data can be seen in Table
3. The extracted text boxes are compared as described
in section 3.4. As we can see here, for the text pair
HH3a/HH3b, e.g., our method found 186 identical text
boxes with a text length (measured in characters) of
30% of the contract. These two contracts consist of a
very similar structure but with di erent contractual
partners. This means that the underwriters no longer
have to check these passages in the text of the contract
for consistency, thus making their work more e cient.</p>
          <p>As we can see in the Table 3 there are many text
boxes that have received the comparison degree "Di
erent". Again, these are often OCR errors, but they are
too numerous to be classi ed as "OCR errors" (see the
rst example in Table 4 class "Di erent"). The second
example in the class "Di erent" shows that errors in
segmentation and hierarchical sorting also lead to the
classi cation "Di erent". Another problem is that the
text boxes recognized by PDFMiner are not always the
same in the two versions and merging does not entirely
compensate for this, e.g. because one of the elements
was classi ed incorrectly.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussion and Future Work</title>
      <p>In this paper we have shown that modi cations in
contract renewals can be identi ed and analyzed using
supervised learning and text alignment.</p>
      <p>
        We want to continue this approach in further work
and improve the classi cation of the classes heading,
body text and enumeration. In addition, we want to
implement the recognition of named entities, as
described e.g. in
        <xref ref-type="bibr" rid="ref6">(Nanda et al., 2017)</xref>
        . Furthermore, the
text structure can be subdivided in more detail and
further structural elements such as text boxes containing
handwritten notes can be included. We will improve
our approach by carrying out further tests with a larger
training corpus, making further parameter settings and
adding additional features such as font size. During
the course of the project, the existing XML structure
also will be transformed into a standardized legal XML
structure, as proposed by "OASIS LegalXML
Electronic Court Filing TC".3 On this basis we plan the
clause analysis in the contract texts. The recognized
clauses will be checked against a collection of model
clauses and the occurrence of the same or almost same
clause in other contract will be checked. We plan to
visualize the status of each clause, like unchanged, found
in another contract, etc.
      </p>
      <p>With the visualization of the changes in the contract
renewals, a tool can then be implemented that provides
valuable support for underwriters and other legal
entities in their daily work and simpli es and improves
their daily work in the long term.</p>
      <sec id="sec-4-1">
        <title>Acknowledgements</title>
        <p>The authors would like to thank Fabian Schmieder
for many helpful discussions and pointing us to the
publicly available contracts of the City of Hamburg.</p>
        <p>3OASIS LegalXML: https://www.oasis-open.org/
committees/tc_home.php?wg_abbrev=legalxml-courtfiling</p>
        <p>Reference
HHTrain1
HHTrain2
HHTrain3</p>
        <p>4 non public reinsurance contracts
Reference
HH1a</p>
        <p>File name</p>
        <p>Aenderungsbescheid.pdf
HH1b
HH2a
HH2b
HH3a
HH3b
HH4a
HH4b
HH5a
HH5b</p>
        <p>Befristete Genehmigung nach HBauO.pdf
Akte 000.00-04.pdf
Akte 000.00-04(1).pdf
Akte FB63.51-06(1).pdf
Akte FB63.51-06(3).pdf</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Chalkidis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Androutsopoulos, and</article-title>
          <string-name>
            <given-names>A</given-names>
            .
            <surname>Michos</surname>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Extracting contract elements</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Dejean</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>J.-L. Meunier</surname>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>A system for converting PDF documents into structured XML format</article-title>
          .
          <source>In Document Analysis Systems VII, Lecture Notes in Computer Science</source>
          , pp.
          <volume>129</volume>
          {
          <fpage>140</fpage>
          . Springer, Berlin, Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Dozier</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kondadadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Light</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vachher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Veeramachaneni</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Wudali</surname>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Named entity recognition and resolution in legal text</article-title>
          .
          <source>In Semantic Processing of Legal Texts, Lecture Notes in Computer Science</source>
          , pp.
          <volume>27</volume>
          {
          <fpage>43</fpage>
          . Springer, Berlin, Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Qiu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Structure extraction from PDF-based book documents</article-title>
          .
          <source>In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL '11</source>
          , pp.
          <volume>11</volume>
          {
          <fpage>20</fpage>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Klamp</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Granitzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jack</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Kern</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Unsupervised document structure analysis of digital scienti c articles</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Nanda</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Siragusa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Di</given-names>
            <surname>Caro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Theobald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Boella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Robaldo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Costamagna</surname>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Concept recognition in european and national law</article-title>
          . In A. Z. Wyner and G. Casini (Eds.),
          <source>Legal Knowledge and Information Systems - JURIX</source>
          <year>2017</year>
          :
          <article-title>The Thirtieth Annual Conference</article-title>
          , Luxembourg,
          <fpage>13</fpage>
          -
          <lpage>15</lpage>
          December 2017, Frontiers in
          <source>Arti cial Intelligence and Applications</source>
          , pp.
          <fpage>193</fpage>
          . IOS Press.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Paick</surname>
            ,
            <given-names>Y. Y. K.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Y. P. Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          (
          <year>2004</year>
          ).
          <article-title>PDF2xml: Converting PDF to XML.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Ramakrishnan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Patnia</surname>
          </string-name>
          , E. Hovy, and
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Burns</surname>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Layout-aware text extraction from full-text PDF of scienti c articles</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Schweighofer</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Semantic indexing of legal documents</article-title>
          .
          <source>In Semantic Processing of Legal Texts, Lecture Notes in Computer Science</source>
          , pp.
          <volume>157</volume>
          {
          <fpage>169</fpage>
          . Springer, Berlin, Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>Akte</source>
          <volume>611</volume>
          .
          <fpage>10</fpage>
          -
          <issue>13</issue>
          (
          <issue>1</issue>
          ).pdf
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>Akte FB2a.809.13-25</source>
          <volume>4</volume>
          (
          <issue>1</issue>
          ).pdf
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>Akte FB2a.800.01-2</source>
          <volume>3</volume>
          (
          <issue>1</issue>
          ).pdf
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Train</surname>
          </string-name>
          data1-4
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Training Documents</surname>
            <given-names>URL</given-names>
          </string-name>
          http://suche.transparenz.hamburg.de/dataset/oe entlichrechtlicher-vertrag
          <article-title>-gehrecht-bebauungsplan-</article-title>
          <string-name>
            <surname>harburg-</surname>
          </string-name>
          59
          <string-name>
            <surname>-</surname>
          </string-name>
          theodoryork-strasse?forceWeb=true http://suche.transparenz.hamburg.de/dataset/aenderungsverfahrenfuer-vertrag-6328
          <string-name>
            <surname>-</surname>
          </string-name>
          zuvex
          <article-title>-weitere-schritte-zur-anbindung-externernutzer?forceWeb=true http://suche</article-title>
          .transparenz.hamburg.de/dataset/v6921- unterstuetzungsleistung
          <article-title>-mobility-vertrag?forceWeb=true</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>