<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Classification of Contract-Amendment Relationships</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fuqi Song</string-name>
          <email>fsong@hyperlex.ai</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Science</institution>
          ,
          <addr-line>Hyperlex, 13 Rue de la Grange Batelière, 75009 Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In Contract Life-cycle Management (CLM), managing and tracking the master agreements and their associated amendments is essential, in order to be kept informed with diferent due dates and obligations. An automatic solution can facilitate the daily jobs and improve the eficiency of legal practitioners. This paper proposes an approach based on machine learning (ML) and Natural Language Processing (NLP) to detect the amendment relationship between two documents. The algorithm takes two PDF documents preprocessed by OCR (Optical Character Recognition) and NER (Named Entity Recognition) as input, and then it builds the features of each document pair and classifies the relationship. Diferent configurations are experimented on a dataset consisting of 1124 pairs of contract-amendment documents in English and French. The best result obtained a F1-score of 91%, which outperformed 23% compared to a heuristic-based baseline. amendment detection, document linking, NLP, relationship classification, contract life-cycle manageCEUR Workshop Proceedings</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Problem Statement</title>
      <p>In Contract Life-cycle Management(CLM ), the contracts and other documents are not isolated
elements. There exists links among them, the most common and important one being the
contract-amendment relationship between a master agreement (MA) and an amendment.
Tracking and handling such links is essential in diferent CLM tasks so as to lower the potential
legal risks and be up to date relatively to the evolution of a contract through its amendments.
Conventionally, the task is performed manually or semi-automatically within a digital solution,
which is time-consuming and error-prone. A fully automatic solution is expected to overcome
these drawbacks and to facilitate the CLM process.</p>
      <p>This article therefore proposes a method for automatically detecting linked documents based
on machine learning algorithms and NLP techniques. The problem can be formulated as a
binary classification problem that takes two documents as input and classifies the relationship
between them. A key problem is to identify a good feature set for the classification algorithms.
I apply a ML-driven preprocessing pipeline, including principally OCR and NER. The pipeline
outputs the recognized document content and named entities, such as corporate names and
contract numbers. Depending on the quality and content of documents, the extracted text
and named entities might contain errors and be inaccurate. Taking these factors into account
and trying to be robust, this article proposes a similarity and cross reference-based approach
RELATED - Relations in the Legal Domain Workshop, in conjunction with ICAIL 2021, June 25, 2021, São Paulo, Brazil
nEvelop-O</p>
      <p>© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
for extracting features from a pair of documents. The approach is robust to diferent errors
introduced during preprocessing and allows taking into account multiple uncertain factors to
classify the relationship.</p>
      <p>
        The general schema of the entire process is illustrated in Figure 1: the pipeline takes a pair
of PDF documents as input and perform three main steps to detect whether or not the two
documents are related, namely, preprocessing, feature extraction, and classification. The paper
focuses on the definition of the features and the classification algorithms. The preprocessing
step that applies OCR (Optical Character Recognition), NER (Named Entity Recognition [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ])
and entity aggregation [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ] will not be elaborated.
      </p>
      <p>The rest of the paper is organized as follows: Section 2 analyzes the key features that can
distinguish the related contract-amendment documents from nonrelated ones and explains
how the features are represented. Section 3 presents the dataset and the baseline algorithm
used to evaluate the approach. Section 4 experiments diferent configurations to classify the
relationships and analyzes the benchmarking results. Section 5 discusses two typical application
scenarios using the contract-amendment classification as a key component. Section 6 draws
some conclusions and extends the future works.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Feature Building</title>
      <sec id="sec-2-1">
        <title>2.1. Analysis</title>
        <p>
          In the quantification of the contract-amendment relationship of two documents, the following
key pieces of information allowing to distinguish the related documents from nonrelated ones
are identified:
• D o c u m e n t n a m e : In general, the document name provides a lot of indices that help to
deduce the relationship between a pair of documents. Indeed, often the document name
follows certain patterns (which vary for diferent persons and diferent organizations),
for instance, Contract No. X12345.pdf and Contract No. X12345 Amendment 1.pdf ;
• L e g a l p a r t i e s : Very frequently, the legal parties engaged in the contract are the same for
the master agreement and its amendments, with the roles of legal parties in a contract
explained in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ];
• D o c u m e n t b o d y : The document body of the master contract and amendments tend to be
similar in general and are semantically related. An amendment recalls the key information
in the master contract and specifies the modifications in relation to the master contract;
• R e f e r e n c e s : The indices that are referred explicitly in two documents to establish the
relationship. The typical ones are dates and contract numbers, for instance, “... Contract
N°X12345 signed on 14 May 2003 ...” is a typical way used in an amendment to address
the relationship with the master agreement.
        </p>
        <p>Once the features are identified, the next question is how to represent these features with
numerical values. One of the key issues is that the extracted information is not 100% accurate,
for instance, the extracted dates or legal parties might be inaccurate or missing. Therefore,
this paper proposes to build the features based on the distance between two pieces of
information in two documents. Section 2.2 explains the representation of a single document and
Section 2.3 illustrates the feature representation of a document pair that will be used to classify
the relationship.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Document Representation</title>
        <p>A preprocessed document is formally denoted by:
 = (</p>
        <p>name, text, legal_parties, keywords, nature)
wherein
• name denotes the file name given by users;
• text denotes the plain text extracted by OCR;
• legal_parties lists all distinct corporate names extracted by NER from the clause of
declaration of parties;
• keywords includes named entities extracted by NER that could be used as cross references,
and more specifically dates and contract identifiers in this paper. It is however important
to note that the same entity may play distinct roles in the master agreement and in the
amendment. For instance, a named entity signature date in master agreement that is used
as a reference in amendment can be with simple type date;
• nature represents the type of a document in three categories: contract, amendment or other.
nature is determined during the preprocessing, thanks to a text classification algorithm
(with F1-score about 90%). This information is used to filter the document pairs to classify.
More precisely, the classification of relationships is only performed between a pair of
documents where one is a contract and the other is an amendment.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Feature Representation</title>
        <p>
          The feature associated with a pair of documents (
wherein:
1,  2) is denoted as ℱ = ( 1,  2,  3,  4)
•  1 ( D o c u m e n t n a m e ) :  1 represents the similarity between the document names, the string
metric is a string and token-based compound metric described in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ];
•  2 ( D o c u m e n t t e x t ) : To compute the similarity of two texts, the first step is to embed the
text to numerical vectors and then compute the cosine similarity [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In this article, two
embedding methods are tested: TF-IDF and FastText, which is elaborated in Section 4;
•  3 ( L e g a l p a r t i e s ) : The absolute number of shared legal parties. For each corporate
name in  1. _  , we compute the string similarity with each corporate name
in  2. _  . When the similarity is greater than the defined threshold (0.85), we
increment the number of shared legal parties;
•  4 ( R e f e r e n c e s ) : The absolute number of shared keywords, computed following the same
principle as  3.
        </p>
        <p>
          Features  1 and  2 are real numbers ranging [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] whereas features  3 and  4 are discrete
numbers (0, 1, 2, ...).
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset and Baseline</title>
      <p>
        The dataset1 consists of 1124 pairs of documents in relationship contract-amendment of real
contracts from diferent companies. The dataset has been annotated manually by legal experts
by presenting them the pairs of potentially linked documents. The dataset includes diferent
types of contracts with diferent levels of qualities. 617 pairs of documents are in French and
507 pairs in English in order to test a robust multilingual approach. As for the negative samples,
1124 document pairs have been sampled randomly from the contract base and verified manually
by legal experts. The measurements are precision, recall and F1-score (macro) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>The dataset is preprocessed using the pipeline illustrated in Figure 1 which outputs the
processed documents in the format defined in Section 2.3. The correlation between the selected
1Due to confidentiality reasons, the dataset is not publicly accessible.
features and the relationships to classify is analyzed in Figure 2. The figure shows the histogram
and density of each feature in relation to the link types. TF-IDF is used as embedding for
computing the text similarity. A clear pattern can be observed between the related and nonrelated
datasets on the four features, more precisely the values being generally higher for related pairs.
However no feature is discriminating enough to separate related pairs from nonrelated ones.</p>
      <p>To the best of the author’s knowledge, due to the specificity of the research problem, few
works have been published on the topic of classification of contract-amendment relationships.
To evaluate the proposed approach, a heuristic-based baseline is used without the application
of ML techniques, namely, using only the document name and the extracted text. The rules are
as follows: if the similarities of document name and text (with TF-IDF) between two documents
are both greater than 0.5 (as observed in Figure 2), the two documents are considered as related,
otherwise nonrelated.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Classification</title>
      <p>Diferent configurations are experimented to evaluate the impacts of these variables on the
classification and to try to find the best configuration for the final model.</p>
      <p>
        • T e x t e m b e d d i n g : TF-IDF [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and FastText [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] for evaluating the impacts on the document
content feature. FastText is a non-contextual word embedding taking the subwords into
account, which is potentially more robust to OCR errors. The pretrained English and
French word vectors 2 are used in the experiments;
• T r a n s f o r m a t i o n s : The strategies to transform the feature values: 1) None: no
transformation, 2) Binary: the real values are mapped to binary 0 or 1 relatively to a threshold, and
3) Decimal: the real values are mapped proportionally to an integer between 0 and 10;
• C l a s s i f i c a t i o n a l g o r i t h m s : 1) Random Forest (RF), 2) Linear SGD classifier ( Linear),
and 3) Multi-layer perceptron (MLP).
      </p>
      <p>All configurations use the same split of the dataset with 60% for the training set, 20% for the
validation set, and 20% for the test set. Table 1 lists the results of all combinations of diferent
variables. The first row lists the scores of the non ML baseline. The best ML configuration is
the combination of RF, TF-IDF, and None transformation, which obtained an F1-score of 90.9%,
with a strong gain of 23% over the baseline.</p>
      <p>In Figure 3, to get a better understanding of the impact of each variable, the average F1-score
for each variable is computed, aggregating on the other variables. On the one hand, it is observed
that the classifier RF performs better than MLP and Linear while no transformation on feature
values performs better than the other two strategies. On the other hand, is appears that no
significant diference between the two text embedding representations FastText and TF-IDF.
Additionally there is no significant diferences observed between French and English documents,
which is not surprising since the features selected are generic and language-dependent.</p>
      <p>In terms of errors, about 35% of them arise from some required piece of information being
not correctly extracted (partial, missing or incorrect), for instance, a missed contract number
2https://fasttext.cc/docs/en/crawl-vectors.html
will lead to an incomplete feature representation. About 25% of errors are due to confusion
between amendments and other document types (such as appendixes) that exhibit naming
patterns similar to amendments. 20% of errors result from the trained model being not able
to capture the specific conventions followed by each organization, for instance, the contract
referencing system. For the remaining 20%, no reason could be clearly identified.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Applications</title>
      <p>Identifying the amendment relationship between a pair of document is a key step for real
life CLM automation. For instance, the following two concrete scenarios are identified and
implemented:
• L i n k e d d o c u m e n t s s u g g e s t i o n : When the user uploads a new document, the software
can suggest a list of potentially linked documents to the user. The user will need to simply
verify and validate the suggested documents instead of searching or selecting manually
the linked documents;
• A u t o m a t i c s o r t i n g : For new users, when they upload their contracts (generally a big
volume) for the first time, this function could help them to structure their contract database
by making explicit links between the master agreements and their amendments.</p>
      <p>In practice, some settings and thresholds on the prediction probability may difer according
to the above-mentioned scenarios. In the first application, we wish to favor the recall so as to
suggest all possible linked documents: a relatively low threshold of probability will be suficient.
Additionally, a parameter top_x to limit the number of suggestions can be set, namely, to keep
x top best predictions. Whereas in the second case, we prefer to set a higher threshold to
guarantee a high precision, in order to ensure that the automatically sorted documents are
correct.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Works</title>
      <p>This paper addresses the problem of contract-amendment classification in CLM and shows
some promising results. A distance and cross reference-based approach is proposed to build
the features of a pair of documents and several configurations are evaluated to classify the
relationships. The best configuration outperforms 23% in terms of F1-score compared to
the baseline, which is a heuristic-based method without the application of machine learning
techniques. The obtained results can be applied to diferent application scenarios in the CLM
automation and bring real benefits to the final users.</p>
      <p>
        Based on the error analysis performed in Section 4, the following aspects will be studied in
future to improve the approach:
• R e i n f o r c e p r e p r o c e s s i n g : As 34% of errors are related to the fact that the required
information is not well extracted, the reinforcement of OCR and NER could improve these
issues. Furthermore, these improvements will contribute to other tasks in the whole CLM
pipeline;
• T r a i n m o d e l b y u s e r : To capture the user preferences, training on the dataset of each
user would help to make the model more customized and accurate;
• I m p r o v e c r o s s - r e f e r e n c e d e t e c t i o n : The current method checks the number of some
shared keywords as a feature to detect cross document references. However, this can be
improved with Named Entity Linking (NEL) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], particularly a graph-based approach [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ];
• E x p l o r e t e x t u a l f e a t u r e s : The current distance-based features could be enriched by
adding textual features using such as Doc2Vec and BERT;
• F i n e - t u n e t h e s e t t i n g s : The thresholds for computing features  3 and  4 are chosen
empirically, which can be fine-tuned in order to find the optimal configuration.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgment</title>
      <p>I thank Dr. Éric de la Clergerie, researcher at INRIA (team Alpage3) and the members of Data
Science team at Hyperlex4 for discussions and comments that improved this manuscript.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Nadeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sekine</surname>
          </string-name>
          ,
          <article-title>A survey of named entity recognition and classification, Lingvisticae InvestigationesLingvisticae InvestigationesLingvisticae Investigationes</article-title>
          .
          <source>International Journal of Linguistics and Language Resources</source>
          <volume>30</volume>
          (
          <year>2007</year>
          ).
          <source>doi:1 0 . 1 0 7 5 / l i . 3 0 . 1 . 0 3 n a d .</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Yadav</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bethard</surname>
          </string-name>
          ,
          <article-title>A survey on recent advances in named entity recognition from deep learning models</article-title>
          , arXiv preprint arXiv:
          <year>1910</year>
          .
          <volume>11470</volume>
          (
          <issue>2019</issue>
          , unpublished).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Pons</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Latapy</surname>
          </string-name>
          ,
          <article-title>Computing communities in large networks using random walks</article-title>
          ,
          <source>Journal of Graph Algorithms and Applications</source>
          <volume>10</volume>
          (
          <year>2006</year>
          ).
          <source>doi:1 0 . 7 1 5 5 / j g a a . 0 0</source>
          <volume>1 2 4 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fortunato</surname>
          </string-name>
          , Community detection in graphs,
          <source>Physics Reports</source>
          <volume>486</volume>
          (
          <year>2010</year>
          ).
          <source>doi:1 0 . 1 0</source>
          <volume>1 6</volume>
          / j . p h y s r e p .
          <volume>2 0 0 9 . 1 1</volume>
          . 0 0
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Song</surname>
          </string-name>
          , É. de la Clergerie,
          <article-title>Clustering-based automatic construction of legal entity knowledge base from contracts</article-title>
          ,
          <source>in: 2020 IEEE International Conference on Big Data (Big Data)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>2149</fpage>
          -
          <lpage>2152</lpage>
          .
          <source>doi:1 0 . 1 1 0 9 / B i g D a t a 5 0</source>
          <volume>0 2 2 . 2 0 2 0 . 9 3 7 8 1 6 6 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Schütze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          , Introduction to information retrieval, volume
          <volume>39</volume>
          , Cambridge University Press Cambridge,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Powers</surname>
          </string-name>
          ,
          <article-title>Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>16061</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rajaraman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Ullman</surname>
          </string-name>
          , Data Mining, Cambridge University Press,
          <year>2011</year>
          , p.
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          .
          <source>doi:1 0 . 1 0 1 7 / C B O 9 7</source>
          <volume>8 1 1 3 9 0 5 8 4 5 2 . 0 0</volume>
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , T. Mikolov,
          <article-title>Enriching word vectors with subword information, Transactions of the Association for Computational Linguistics 5 (</article-title>
          <year>2017</year>
          )
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hachey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nothman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Honnibal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Curran</surname>
          </string-name>
          ,
          <article-title>Evaluating entity linking with wikipedia</article-title>
          , volume
          <volume>194</volume>
          ,
          <year>2013</year>
          .
          <source>doi:1 0 . 1 0 1 6 / j . a r t i n t . 2 0 1 2 . 0 4 . 0 0 5 .</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hachey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Curran</surname>
          </string-name>
          ,
          <article-title>Graph-based named entity linking with wikipedia</article-title>
          ,
          <source>in: International conference on web information systems engineering</source>
          , Springer,
          <year>2011</year>
          , pp.
          <fpage>213</fpage>
          -
          <lpage>226</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>