=Paper=
{{Paper
|id=Vol-3033/paper26
|storemode=property
|title=A First Step Towards Automatic Consolidation of Legal Acts: Reliable Classification of Textual Modifications
|pdfUrl=https://ceur-ws.org/Vol-3033/paper26.pdf
|volume=Vol-3033
|authors=Samuel Fabrizi,Maria Iacono,Andrea Tesei,Lorenzo De Mattei
|dblpUrl=https://dblp.org/rec/conf/clic-it/FabriziITM21
}}
==A First Step Towards Automatic Consolidation of Legal Acts: Reliable Classification of Textual Modifications==
<pdf width="1500px">https://ceur-ws.org/Vol-3033/paper26.pdf</pdf>
<pre>
    A First Step Towards Automatic Consolidation of Legal Acts: Reliable
                   Classification of Textual Modifications

               Samuel Fabrizi, Maria Iacono, Andrea Tesei and Lorenzo De Mattei
                                      Aptus.AI / Pisa, Italy
                     {samuel,maria,andrea,lorenzo}@aptus.ai


                        Abstract                                 two main steps: a) the identification and classifi-
                                                                 cation of the textual modifications in amendment
    The automatic consolidation of legal texts                   acts; b) the integration within a single document of
    with the integration of its successive                       the textual modifications identified in the previous
    amendments and corrigenda might have an                      step. The first step can be expressed as the auto-
    important practical impact on public insti-                  matic classification of textual modifications inside
    tutions, citizens and organizations. This                    a legal document. In this work, we focus on step
    process involves two steps: a) the clas-                     a).
    sification of the textual modifications in                   Several authors tried to solve this task using stan-
    amendment acts and b) the integration                        dard Natural Language Processing (NLP) tech-
    within a single document of such mod-                        niques. Ogawa et al. (2008) showed that amend-
    ifications. In this work we propose a                        ment clauses described in the Japanese statutes
    methodology to solve step a) by exploiting                   can be formalized in terms of sixteen regular ex-
    Machine Learning and Natural Language                        pressions. Lesmo et al. (2009) tried to identify
    Process techniques on the Italian versions                   and classify integrations, substitutions and dele-
    of European Regulations: our results sug-                    tions using a three-step approach: 1) prune text
    gest that the methodology we propose is                      fragments that do not convey relevant informa-
    a reliable first milestone toward the auto-                  tion, 2) perform the syntactic analysis of the re-
    matic consolidation of legal texts.                          trieved sentences, 3) semantically annotate the
                                                                 provision using a rule-based approach based on
                                                                 tree. In this last step, they also used a knowl-
1    Introduction                                                edge base that describes the provisions taxonomy
                                                                 (Arnold-Moore, 1997).4 Brighi et al. (2008) and
Consolidation consists of the integration in a le-
                                                                 Spinosa et al. (2009) followed a similar approach.
gal act of its successive amendments and corri-
                                                                 In both cases, semantic analysis is carried out on
genda.1 Consolidated texts are very important for
                                                                 the syntactically pre-processed text using a rule-
legal practitioners. However, their maintenance is
                                                                 based approach. The difference is related to the
a tedious task. Some regulatory publishers such as
                                                                 starting point of the semantic analysis. The for-
Normattiva2 provide continuously updated consol-
                                                                 mer’s system relied on a deep semantic analysis of
idated texts, others such as Eur-Lex3 do times to
                                                                 the textual modifications. The latter started from
times, some other do not. The automation of this
                                                                 the shallow syntactically parsed text. Garofalakis
process could let institutions to save resources and
                                                                 et al. (2016) presented a semi-automatic system
practitioners to access continuously updated con-
                                                                 for the consolidation of Greek legislative texts
solidated documents. This achievement would let
                                                                 based on regular expressions. Francesconi and
organizations stay compliant with the normative
                                                                 Passerini (2007) defined a module that automat-
more easily. The consolidation process involves
                                                                 ically classifies paragraphs into provision types.
      Copyright © 2021 for this paper by its authors. Use per-   Each paragraph is represented using Bag of words
mitted under Creative Commons License Attribution 4.0 In-
ternational (CC BY 4.0).                                         either with TF-IDF weighting (Salton and Buck-
    1
      Eur-Lex, About consolidation, https://bit.ly/2
VFyGhv                                                              4
                                                                      A legislative provision represents the meaning of a law
    2
      Normattiva, https://www.normattiva.it/                     part from a legal point of view. Obligations, definitions and
    3
      Eur-Lex, https://eur-lex.europa.eu/                        modifications are specific types of provision.
ley, 1988) or binary weight. The authors showed                – to annotates the words that replace the
an experimental comparison of the different repre-               previous ones (“novella”).
sentation methods using the Naive Bayes and Mul-
ticlass Support Vector Machine (MSVM) models.              • replacement ref is a type of replacement. We
This paper describes our approach in the classifi-           use it to handle textual modifications that in-
cation of textual modifications, namely substitu-            clude attachments.
tion, addition, repeal and abolition. The proposed         • addition annotates textual modifications that
approach is based on standard statistical NLP tech-          add or complete a part of a legal document.
niques (Manning and Schutze, 1999). Our method
involves i) the use of XML-based standards for the         • repeal indicates the removal or reversal of a
annotation of legislative documents, ii) the con-            law. It is used to invalidate its provisions al-
struction of the dataset assigning a label to each           together.
word according to the tagging format used, and
                                                           • abolition indicates the removal of a law part.
iii) the implementation of NLP models to iden-
                                                             It is used to replace the law with an updated,
tify and classify textual modifications. We carried
                                                             amended or related law. This textual mod-
out a systematic comparison among several fea-
                                                             ification could just involve single words or
ture extraction techniques and models. The main
                                                             whole subdivision as in the replacements.
contribution of this paper is the application of ma-
chine learning models to classify textual modifica-
tions. In contrast to rule-based or regular expres-                     Category         Total
sion techniques, our models do not need expert                         replacement        308
knowledge about the application domain’s proper-                           from           95
ties. They try to extract formulas used to introduce                         to           95
a textual modification without the need for an ex-                   replacement ref      34
plicit definition of all the formulas. Our approach                      addition         96
leads to lower maintenance costs and hopefully in-                        repeal          93
creased robustness of the system.                                        abolition        92

2       Data                                            Table 1: Total number of textual modifications for
                                                        each category
We extracted the data from Daitomic5 , a product
that contains all the regulations from a set of legal
                                                        Table 2 reports an example for each of the men-
sources encoded automatically in Akoma Ntoso
                                                        tioned categories. Table 1 shows the total number
standard format (Palmirani and Vitali, 2011). We
                                                        of textual modifications per category. The number
collected from this product all the Italian versions
                                                        of replacements examples is greater than that the
of the amendment documents originally extracted
                                                        others types of modifications because substitutions
from Eur-Lex and we randomly sampled 260 legal
                                                        can be introduced by different formulas that deter-
documents for manual labelling.
                                                        mine their specific meaning. Indeed, from a pre-
Accordingly to the Eur-Lex web service specifica-
                                                        liminary experiment, we understood that there is
tions6 , we identified seven different types of tex-
                                                        a relationship of proportionality between the num-
tual modifications:
                                                        ber of formulas used to introduce textual modifica-
    • replacement annotates a substitution which        tions and the number of examples needed to train
      may concern a part of a sentence (expression,     the models. For this reason, we needed a different
      word, date, amount) or a whole subdivision        number of examples for each category to train our
      of the document (article, paragraph, indent).     models.
      Usually, this type of textual modification in-    Given the differences among the nature of each
      cludes also the following subcategories:          modification type, we preferred to split the orig-
          – from annotates the replaced words           inal problem into five subtasks, namely:
            (“novellando”).                               1. replacement classification that also contains
    5
   Daitomic, https://www.daitomic.com/                       the replacement ref category;
    6
   Eur-Lex, How to use the webservice?, https://bit.
ly/393qt9Z                                                2. addition classification;
  3. repeal classification;                                     between quotes because it has led to a perfor-
                                                                mance improvement.
  4. abolition classification;
                                                         3     Experiments
  5. from to classification.
                                                         For each task, we gathered the documents that
The manual annotation consisted in assigning one         contain one or more occurrences of that specific
label at each token of the selected document for         modification. Then, we split the dataset into a
each subtask that indicates if it represents or not      training and a test set. More precisely, we used the
a textual modification. We defined three different       80/20 ratio adopting a stratified technique (Trost,
tagging formats: Inside-Outside-Beginning (IOB),         1986). We used the training set to validate the hy-
Inside-Outside (IO), Limit-Limit(LL). The first          perparameters of each model. Once computed the
two tagging formats are standard.7 The last one,         final models, we made use of the test set to mea-
instead, uses the prefix “L-” to indicate that the to-   sure their generalization ability. It is important to
ken is either the beginning or end of a textual mod-     emphasise that we never used the internal test set
ification. We adopted a specific tagging format for      before the definition of the final models.
each model based on our preliminary results. The         The general pipeline is composed of the following
tagging format was one of the most critical choices      steps:
to improve model performance.
The dataset used for the last subtask is different.          1. The annotated documents are tokenized.
Indeed, the from and to tags are always enclosed             2. Each token is associated with one label for
within the replacement tags. We could not use any               each category following the tagging formats
of our tagging formats because their syntax does                previously defined.
not permit any nesting (Dai, 2018). Therefore,
we decided to change the dataset itself to train the         3. From each token, we extract its represen-
models. We considered only the tokens inside the                tation using either hand-crafted features or
sentences representing a replacement and tagged                 character level N-grams or word embeddings.
them using the aforementioned tagging formats.                  Depending on the model used, both tagging
In this way, we avoided the nesting issue.                      format and features extraction change.

2.1   Preprocessing                                          4. We execute the model selection phase ex-
                                                                ploiting K-fold cross-validation. In our ex-
Each model needs a different preprocessing
                                                                periments, we set the K parameter to 3 so
method to process the raw text legal documents,
                                                                that validation sets size is reasonable. The
depending on the feature extractor used. There are
                                                                purpose of this step is to find the best hyper-
only a few preprocessing operations common to
                                                                parameters of each model.
all models:
                                                             5. For each subtask, we chose the model with
  1. substitution of the special characters ≪ and
                                                                the best performance in the previous step.
     ≫ with the quote marks;

                                                             6. After choosing the best configuration of each
  2. substitution of words between quote marks                  model, we computed and compared their per-
     with the special token QUOTES TEXT. This                   formances over the test set.
     step has allowed us to limit the number of to-
     kens in each paragraph. The words between           3.1    Feature Extraction
     quote marks often represent a whole article         We applied several feature extraction techniques to
     (for example to substitute or to add). We de-       figure out which one was the most effective. In this
     cided to substitute these words with a special      section, we will explain these techniques with an
     token because they are redundant for our task.      in-depth description. Considering the nature of the
     This consideration permits us to improve the        task, all the features are extracted at the word level.
     performances of all models. In the from and         We define different sets of features according to
     to subtask, we avoided substituting the text        the models’ needs. We logically divided our fea-
  7
    Breckbaldwin, Coding Chunkers as Taggers: IO, BIO,   tures into hand-crafted features, n-gram features
BMEWO, and BMEWO+, https://bit.ly/3DzuqBc                and word embeddings.
                      All’articolo 7 della decisione 2005/692/CE, la data del
 replacement          <replacement> ≪ <from> 31 dicembre 2010 </from> ≫
                      è sostituita da ≪ <to> 30 giugno 2012 </to> ≫ </replacement>.
                      L’allegato II al regolamento (CE) n. 998/2003 è sostituito dal testo dell’
 replacement ref
                      < replacement ref > allegato </replacement ref> al presente regolamento.
                      È aggiunto il seguente allegato:
 addition
                      <addition> “ALLEGATO III [...]” </addition>
 repeal               Il regolamento (CEE) n. 160/88 è abrogato. <repeal></repeal>
 abolition            nel titolo i termini <abolition>“raccolti nel 1980” </abolition>sono soppressi

                                     Table 2: Annotations examples


In the following we list the hand-crafted features     each set of words, it produces a sparse vector rep-
extracted and their meaning:                           resentation that captures a large number (376037)
                                                       of character n-grams features.
   • is upper: boolean value indicating whether        Finally, we decided to use a word embedding
     the token is in uppercase                         lexicon as it has been shown that provides good
   • is lower: boolean value indicating whether        performances in other Italian tasks (De Mattei
     the token is in lowercase                         et al., 2018; Cimino et al., 2018). We tested a
                                                       few different in-domain and general purpose em-
   • is title: boolean value indicating whether the    beddings lexicons trained using both fastText (Bo-
     token is in titlecase                             janowski et al., 2017) and word2vec (Mikolov et
                                                       al., 2013), we obtained the best results with fast-
   • is alpha: boolean value indicating whether        Text pretrained Italian model (Grave et al., 2018).
     the token consists of alphabetic characters       The features extracted from each token do not con-
                                                       tain enough information to discriminate the true
   • is digit: boolean value indicating whether the
                                                       amendment class. For this reason, we decided
     token consists of digits
                                                       to introduce the sliding window concept (Diet-
   • is punct: boolean value indicating whether        terich, 2002). It represents a set of tokens that pre-
     the token is a punctuation mark                   cede and/or follow each token, like a “window”
                                                       with a fixed size that moves forward through the
   • pos val cg: coarse-grained part-of-speech         text. For each feature extraction technique, we
     from the Universal POS tag set (Kumawat           introduced two parameters, window size and
     and Jain, 2015): the text has been POS tagged     is bilateral window. The former indicates
     with SpaCy Italian model8                         the dimension of the window. The latter is a
                                                       boolean value indicating whether the window con-
   • is alnum: boolean value indicating whether
                                                       siders only the preceding tokens (False) or both
     all characters in the token are alphanumeric
                                                       preceding and following tokens (True). For exam-
     (either alphabets or numbers)
                                                       ple, the sentence “È aggiunto il seguente allegato”
   • word lower: token in lowercase                    with a bilateral sliding window of size 1, becomes
                                                       〈(PAD, È, aggiunto), (È, aggiunto, il), (aggiunto,
   • word[-3:]: last three characters of the token     il, seguente), (il, seguente, allegato), (seguente, al-
                                                       legato, PAD)〉 where PAD indicates the padding
   • word[-2:]: last two characters of the token
                                                       value. The introduction of the sliding window has
Then, we decided to use a more complex represen-       made it possible to improve the evaluation metric
tation. We used a Count Vectorizer (Sarlis and         of all models.
Maglogiannis, 2020) computed over all the Ital-
ian legal documents contained in EUR-Lex at the        3.2   Models
date we created it. It converts a collection of text
                                                       We want to find a fully automatic approach based
documents to a matrix of n-gram counts. From
                                                       on the extraction of interesting features. For this
   8
       Spacy, Models, https://spacy.io/models/it       reason, we developed a systematic comparison
among three models: Support Vector Machine             The CRF outperforms other models in almost all
(SVM) with n-gram features, Conditional Ran-           the subtasks. We think that it is due to the na-
dom Field (CRF) with hand-crafted features and         ture of this model. Indeed, CRFs naturally con-
a Neural Network (NN) that uses word embed-            sider state-to-state dependencies and feature-to-
dings. This latter model is a rather general con-      state dependencies (Lafferty et al., 2001). Once
volutional network architecture. The inputs of our
NLP tasks are the words that compose the slid-                  Subtask      SVM      CRF       NN
ing window represented as a matrix. Each row                  Replacement    0.868    0.881    0.841
of the matrix corresponds to the word embedding                Addition      0.825    0.852    0.796
representation of one token. We decided to use a                Repeal       0.915    0.938    0.924
convolutional layer given its efficiency in terms of           Abolition     0.823    0.878    0.939
both representation and speed; it permits us to cap-           From To       0.748    0.873    0.800
ture local and position-invariant features (Yin et
al., 2017) useful for our purpose. Then, we added      Table 3: Average results in terms of F1 macro
a Batch Normalization layer. It significantly re-      score obtained in the validation phase
duces the training time in feedforward neural net-
works (Ba et al., 2016). During the experiment         completed the model selection phase, we chose the
phase, we observed that layer normalization of-        best model and its configuration for each subtask.
fers a speedup over the baseline model without         We considered both the mean and standard devi-
normalization and it stabilizes the training of the    ation of the f1 metric among the folds. Then, we
model. We have also tried to use a Bidirectional       re-trained the best model on the whole training set.
Long Short-Term Memory based model with an             Table 4 reports the results and the average score of
additional CRF layer (Bi-LSTM-CRF) to solve            the precision, recall and F1 metrics over the in-
our task (Huang et al., 2015). Its application leads   ternal test set. The precision score is higher than
to poor performance in terms of scores and speed.      recall in all except one subtask which may be good
The results obtained show the need to solve our        for an application perspective.
task using simple models that are able to discover
local patterns.                                                           Model   Prec.    Rec.      F1
                                                           Replacement    CRF     0.949    0.864    0.902
                                                            Addition      CRF     0.790    0.865    0.823
4   Results
                                                             Repeal       CRF     0.937    0.912    0.924
The objective of the evaluation was to define a             Abolition      NN     0.951    0.912    0.931
systematic comparison among the models’ perfor-             From To       CRF     0.977    0.841    0.899
mance with respect to F1 macro, precision and re-
                                                       Table 4: Precision, recall and F1 scores of the best
call. In the model selection step, we used the F1
                                                       model for each subtask
macro score as the evaluation metric since the fre-
quency distribution of the labels turned out to be     The models’ performances are improved com-
strongly unbalanced in all the subtasks.               pared to the results achieved in the model selec-
After some preliminary experiments, we fixed the       tion phase, probably thanks to the larger training
sliding window size and the tagging format for         set provided.
each model. We found that both the CRF and NN
models are more inclined to use a bigger sliding       5     Conclusion
window size (5) than the SVM models (1) from
a performance-based perspective. We think this         We presented and analysed a machine-learning ap-
difference comes from the Curse of Dimensional-        proach to the problem of the classification of tex-
ity problem that could be encountered in the SVM       tual modifications. We compared different tag-
models (Bengio et al., 2005). Concerning the tag-      ging formats, feature extractor techniques and ma-
ging format, we adopted the LL tagging for all the     chine learning models. Our experiments show that
models. Our experiments show that it increases         the sliding window approach, combined with char
the f1 score of about 20 percentage points.            count vectorizer or word embeddings, allows the
Table 3 reports the mean results among the 3-fold      models to capture most of the formulas that in-
obtained by the best configuration of each model.      troduce textual modifications. Following Occam’s
razor principle, we defined simple models that ob-         Enrico Francesconi and A. Passerini. 2007. Automatic
tained good performances in all the subtasks. Our            classification of provisions in legislative texts. Arti-
                                                             ficial Intelligence and Law, 15:1–17, 01.
approach does not need any expertise in the law
field since it tries to formalized rules to identify       John Garofalakis, Konstantinos Plessas, and Athana-
textual modifications. We use different NLP tech-            sios Plessas. 2016. A semi-automatic system for the
niques to extract hidden features from the words             consolidation of greek legislative texts. In Proceed-
                                                             ings of the 20th Pan-Hellenic Conference on Infor-
inside a window.                                             matics, PCI ’16, New York, NY, USA. Association
Results validate our approach in terms of both cor-          for Computing Machinery.
rectness and stability. They represent the first step
                                                           Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar-
to build a fully automatic model capable to iden-            mand Joulin, and Tomas Mikolov. 2018. Learn-
tify and integrates textual modifications.                   ing word vectors for 157 languages. In Proceed-
                                                             ings of the International Conference on Language
                                                             Resources and Evaluation (LREC 2018).
References                                                 Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirec-
Timothy Arnold-Moore. 1997. Automatic generation             tional lstm-crf models for sequence tagging.
  of amendment legislation. In ICAIL ’97, pages 56–        Deepika Kumawat and Vinesh Jain. 2015. Pos tagging
  62, 01.                                                    approaches: a comparison. International Journal of
                                                             Computer Applications, 118(6).
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-
   ton. 2016. Layer normalization. arXiv preprint          J. Lafferty, A. McCallum, and Fernando Pereira. 2001.
   arXiv:1607.06450.                                          Conditional random fields: Probabilistic models for
                                                              segmenting and labeling sequence data. In ICML.
Yoshua Bengio, Olivier Delalleau, and Nicolas
  Le Roux. 2005. The curse of dimensionality for           Leonardo Lesmo, Alessandro Mazzei, and Daniele
  local kernel machines. Techn. Rep, 1258:12.                Radicioni. 2009. Extracting semantic annotations
                                                             from legal texts. In HT ’09, pages 167–172, 01.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
                                                           Christopher Manning and Hinrich Schutze. 1999.
   Tomas Mikolov. 2017. Enriching word vectors with
                                                             Foundations of statistical natural language process-
   subword information. Transactions of the Associa-
                                                             ing. MIT press.
   tion for Computational Linguistics, 5:135–146.
                                                           Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
Raffaella Brighi, Leonardo Lesmo, Alessandro Mazzei,         rado, and Jeff Dean. 2013. Distributed representa-
  Monica Palmirani, and Daniele Radicioni. 2008.             tions of words and phrases and their composition-
  Towards semantic interpretation of legal modifica-         ality. In C. J. C. Burges, L. Bottou, M. Welling,
  tions through deep syntactic analysis. volume 189,         Z. Ghahramani, and K. Q. Weinberger, editors, Ad-
  pages 202–206, 01.                                         vances in Neural Information Processing Systems,
                                                             volume 26. Curran Associates, Inc.
Andrea Cimino, Lorenzo De Mattei, and Felice
  Dell’Orletta. 2018. Multi-task learning in deep          Yasuhiro Ogawa, Shintaro Inagaki, and Katsuhiko
  neural networks at evalita 2018. Proceedings of            Toyama.      2008.     Automatic consolidation of
  the Wvaluation Campaign of Natural Language Pro-           japanese statutes based on formalization of amend-
  cessing and Speech tools for Italian, pages 86–95.         ment sentences. In Ken Satoh, Akihiro Inokuchi,
                                                             Katashi Nagao, and Takahiro Kawamura, editors,
Xiang Dai. 2018. Recognizing complex entity men-             New Frontiers in Artificial Intelligence, pages 363–
  tions: A review and future directions. In Pro-             376, Berlin, Heidelberg. Springer Berlin Heidelberg.
  ceedings of ACL 2018, Student Research Workshop,
  pages 37–44, Melbourne, Australia, July. Associa-        Monica Palmirani and Fabio Vitali, 2011. Akoma-
  tion for Computational Linguistics.                       Ntoso for Legal Documents, pages 75–100. Springer
                                                            Netherlands, Dordrecht.
Lorenzo De Mattei, Andrea Cimino, and Felice               Gerard Salton and Christopher Buckley. 1988. Term-
  Dell’Orletta. 2018. Multi-task learning in deep neu-       weighting approaches in automatic text retrieval. In-
  ral network for sentiment polarity and irony classifi-     formation Processing & Management, 24(5):513–
  cation. In NL4AI@ AI* IA, pages 76–82.                     523.
Thomas G. Dietterich. 2002. Machine learning for           S. Sarlis and I. Maglogiannis. 2020. On the reusability
  sequential data: A review. In Terry Caelli, Ad-             of sentiment analysis datasets in applications with
  nan Amin, Robert P. W. Duin, Dick de Ridder, and            dissimilar contexts. In Ilias Maglogiannis, Lazaros
  Mohamed Kamel, editors, Structural, Syntactic, and          Iliadis, and Elias Pimenidis, editors, Artificial Intel-
  Statistical Pattern Recognition, pages 15–30, Berlin,       ligence Applications and Innovations, pages 409–
  Heidelberg. Springer Berlin Heidelberg.                     418, Cham. Springer International Publishing.
Pierluigi Spinosa, Gerardo Giardiello, Manola Cheru-
   bini, Simone Marchi, Giulia Venturi, and Simonetta
   Montemagni. 2009. Nlp-based metadata extraction
   for legal text consolidation. In ICAIL, pages 40–49,
   01.
Jan E Trost. 1986. Statistically nonrepresentative strat-
   ified sampling: A sampling technique for qualitative
   studies. Qualitative sociology, 9(1):54–57.
Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich
  Schütze. 2017. Comparative study of cnn and
  rnn for natural language processing. arXiv preprint
  arXiv:1702.01923.

</pre>