<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Title is (Not) All You Need for EuroVoc Multi-Label Classification of European Laws</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lorenzo Bocchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessio Palmero Aprosio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Trento</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Machine Learning and Artificial Intelligence approaches within Public Administration (PA) have grown significantly in recent years. Specifically, new guidelines from various governments recommend employing the EuroVoc thesaurus for the classification of documents issued by the PA. In this paper, we explore some methods to perform document classification in the legal domain, in order to mitigate the length limitation for input texts in BERT models. We first collect data from the European Union, already tagged with the aforementioned taxonomy. Then we reorder the sentences included in the text, with the aim of bringing the most informative part of the document in the first part of the text. Results show that the title and the context are both important, although the order of the text may not. Finally, we release on GitHub both the dataset and the source code used for the experiments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;EuroVoc taxonomy</kwd>
        <kwd>Sentence reordering</kwd>
        <kwd>Text classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
Dec 04 — 06, 2024, Pisa, Italy
* Corresponding author.
† These authors contributed equally.
$ lorenzo.bocchi@unitn.it (L. Bocchi); a.palmeroaprosio@unitn.it
(A. Palmero Aprosio)
0000-0002-1484-0882 (A. Palmero Aprosio)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
1https://bit.ly/eurovoc-ds
2https://bit.ly/eurovoc-conference</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>EUR-Lex are tagged with labels from this level. Each TC
is linked to an MT, which is then part of a specific DO.</p>
      <p>The version of EuroVoc used for our studies is 4.17,
released on 31st January 2023, containing 7,382 TCs, 127
MTs, and 21 DOs.</p>
      <p>There have been a number of studies that explored the
classification of European legislation with EuroVoc labels.</p>
      <p>JRC EuroVoc Indexer [4] is a tool that allows the
categorization of documents with EuroVoc classifiers in 22
languages. The data used is contained in an old dataset 3.3. Dataset collection
[5] with documents up to 2006. The algorithm used
involves generating a collection of lemma frequencies and To collect the documents for our task, we built a set of
weights. These frequencies are associated with specific tools written in Python that can be customized to obtain
descriptors, referred to as associates or topic signatures diferent subsets of the data (year, language, etc.). In total,
in the paper. When classifying a new document, the al- after filtering out the documents not tagged with EuroVoc
gorithm selects the descriptors from the topic signatures or not containing an easy accessible text (for instance, old
that exhibit the highest similarity to the lemma frequency documents only available as scanned PDFs), we collect
list of the new document. around 1.1 million documents in four languages (English,</p>
      <p>The research described in [6] explored the usage of Italian, Spanish, French).</p>
      <p>
        Recurrent Neural Networks on extreme multi-label clas- As a subsequent task, we also removed labels that have
sification datasets, including RCV1 [ 7], Amazon-13K been deprecated by the EuroVoc developers throughout
[8], Wiki-30K and Wiki-500K [9], and an older EUR-Lex the years.4 Following previous work [11], we also remove
dataset from 2007 [
        <xref ref-type="bibr" rid="ref11 ref5">10</xref>
        ]. labels having less than 10 examples.
      </p>
      <p>In [11] the authors explore the usage of diferent deep- Finally, by looking at the data, we see that the labelling
learning architectures. Furthermore, the authors also became consistent starting from 2004, while many
depreleased a dataset of 57,000 tagged documents from EUR- recated labels are still present in documents, especially
Lex. previous to 2010. We therefore consider only documents</p>
      <p>
        There are also other monolingual studies on the topic, published in the interval 2010-2022.
that mainly concentrate on Italian [
        <xref ref-type="bibr" rid="ref20">12</xref>
        ], Croatian [13], The final dataset will consist of 471,801 documents.
and Portuguese [1]. On average, each law is labelled with 6 EuroVoc concepts.
      </p>
      <p>More recent works on multi-language classification on Table 1 shows some statistics about the dataset used.
EuroVoc are described in Chalkidis et al. [14], Shaheen
et al. [15], and Wang et al. [16].</p>
    </sec>
    <sec id="sec-3">
      <title>4. Experiments</title>
      <sec id="sec-3-1">
        <title>4.1. Data split</title>
        <p>EuroVoc’s hierarchical structure is divided into three
layers: Thesaurus Concept (TC), Micro Thesaurus (MT,
previously known as the “sub-sector” level), and
Domain (DO, previously known as the “main sector” level).</p>
        <p>Each layer contains descriptors for documents,
covering a broad range of EU-related subjects such as law,
economics, social afairs, and the environment, each at
varying levels of detail. The TC level is the foundational
layer where all key concepts reside, and documents on
The primary source for European legislation is EUR-Lex3,
a web portal ofering comprehensive access to EU legal To keep our experiments consistent with previous similar
documents. It is available in all 24 oficial languages of approaches [17], we split the data into train, dev, and test
the European Union and is updated daily by its Publica- sets with an approximate ratio of 80/10/10 in percentage,
tions Ofice. Most documents on EUR-Lex are manually respectively.
categorized using EuroVoc concepts. In order to make the training reproducible and to avoid
that a single random extraction could be too (un)lucky, we
3.2. EuroVoc repeat the split using three diferent seeds and a
pseudorandom number generator.</p>
        <p>Each partition into train/dev/test is done using
Iterative Stratification [ 18, 19], in order to preserve the
concept balance.</p>
        <p>Unless diferently specified, all the results in the rest
of the paper refer to the average of the values obtained
by our experiments on the three splits.</p>
        <sec id="sec-3-1-1">
          <title>3https://eur-lex.europa.eu/</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>4https://bit.ly/eurovoc-handbook</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Dataset</title>
      <p>3.1. EUR-Lex
In this Section, we describe the experiments performed
on the above-described data.</p>
      <p>Total documents
Documents with text and EuroVoc labels
Number of EuroVoc labels used before filtering
Number of EuroVoc labels having less than 10 documents
Final number of labels
Removed documents
195,236
118,296
6,098
2,070
4,028
3</p>
      <p>Italian</p>
      <sec id="sec-4-1">
        <title>4.2. Methodology</title>
      </sec>
      <sec id="sec-4-2">
        <title>4.4. Pre-processing</title>
        <p>Our models are trained using BERT [20] and its deriva- The text of the laws is preprocessed using spaCy,9 a
Nattives. ural Language Processing pipeline that can extract
infor</p>
        <p>The choice of the best pre-trained model is very impor- mation from texts in 24 languages. In particular, we used
tant for the accuracy of the classification using the model it to perform sentence splitting part-of-speech tagging,
obtained after fine-tuning. In particular, [ 21] shows that and named-entities recognition, used to extract content
classification tasks over the legal domain obtain better words from the text and perform the selection of the
performance when pre-trained on legal corpora. Never- sentences that are used in the task.
theless, in some preliminary experiments, we have tried
BERT models pre-trained on various datasets (among 4.5. Summarization
them, legal ones of course), but not always the results
award models built from legal texts.</p>
        <p>Although the diference was not statistically
significant, we decided to use these models anyway (from
HuggingFace5):
Given that the input length for these BERT models is 512
tokens, while legislative texts are usually longer,
summarizing the text by using the most important parts of it to
make sure it fits in the input was seen as an important
step to follow.
• legal-bert-base-uncased [22], consisting As underlined in the Introduction, the text of a law is
of 12 GB of diverse English legal text from sev- usually very redundant, and its most representative part
eral fields (e.g., legislation, court cases, contracts) is often after a notable sequence of preambles.
scraped from publicly available resources; Since the limit of 512 tokens is very strong if compared
• bert-base-italian-xxl-cased [23], the to the usual length of a legal document, we concentrate
main Italian BERT model, consisting of a recent our summarization efort on reordering the sentences
Wikipedia dump and various texts from the inside a single document so that the most informative
OPUS corpora collection6 and data from the part of the text can be brought to the beginning and
Italian part of the OSCAR corpus;7 therefore included in the first 512 tokens.</p>
        <p>We use two diferent approaches to reach the goal:
TF-IDF and centroid-based. In both cases, we perform
training with the sole text reordered and the
concatena• bert-base-spanish-wwm-cased [24], also
called BETO, is a BERT model trained on a big</p>
        <p>Spanish corpus8 that consists of 3 billion words;
• camembert-base [25], a state-of-the-art lan- tion of the title and the above text.</p>
        <p>guage model for French based on the RoBERTa
model [26]. 4.5.1. TF-IDF</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Basic configurations</title>
        <p>The basic configurations consist of using the sole title,
the sole text, and the concatenation of the title and the
text. Note that, apart from some rare outliers, title length
is consistently less than 50 tokens.</p>
        <sec id="sec-4-3-1">
          <title>5https://huggingface.co/ 6http://opus.nlpl.eu/ 7https://traces1.inria.fr/oscar/ 8https://bit.ly/big-spanish-corpora</title>
          <p>TF-IDF (Term Frequency-Inverse Document Frequency)
is a widely used technique in information retrieval and
text mining to quantify the importance of terms in a
document within a larger collection of documents. It
aims to highlight terms that are both frequent within a
document and relatively rare in the overall collection,
thus capturing their discriminative power.</p>
          <p>The TF-IDF score of a term in a document is calculated
by multiplying two factors: the term frequency (TF) and</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>9https://spacy.io/</title>
          <p>the inverse document frequency (IDF).</p>
          <p>Let  be the term and  the document:</p>
          <p>tf(, )
idf(, )
=
=</p>
          <p>,
∑︀′∈ ′,
log</p>
          <p>1 + |{ ∈  :  ∈ }|</p>
          <p>After obtaining the embedding for the sentence, its
score is computed as the cosine similarity between the
centroid and the embedding:
sim(,  ) = 1 −</p>
          <p>· 
|||| × ||  ||
4.5.2. Centroid</p>
          <p>By using the previously described approach, every text
was converted into a list of ranked sentences, each with
its own score.
where , is the frequency of term  in document , and
 = || is the number of documents in the set .</p>
          <p>Beyond the usual TF-IDF, we also perform a
labelbased approach, that considers one document for each
label, by concatenating all the texts belonging to the laws 4.6. Random
having that label. Because of the obtained results (see Section 4.7), we also</p>
          <p>Once all the documents have gone through this process, added two configurations that used a random ordering of
the TF-IDF matrix is calculated using Tfidf Vectorizer the sentences (one concatenated with the title, the other
from the Python package scikit-learn10 over the content one containing only the randomly ordered text).
words (see Section 4.4) of the texts.</p>
          <p>After obtaining the TF-IDF matrix, the final step is
to assign a score to each sentence. For each valid base 4.7. Evaluation
form, its score is determined from the TF-IDF matrix
by selecting the highest value within the corresponding
column (which represents a word). These scores are then
added to a list for each sentence. Once a sentence is
processed, the maximum or average score is calculated
(“max” and “mean” in the results). This calculated value
becomes the sentence’s score. The process is repeated
for all sentences in every document.</p>
          <p>The evaluation of our experiments is performed by using
the F1 score, macro-averaged so that each label has the
same weight (this metric awards models that perform
better on less-represented labels). Since we are dealing
with a multi-label classification task, we have to choose
between considering always the same number  of
results ( @, @,  1@) or keeping only the labels
whose confidence is higher than a particular threshold
(usually between 0 and 1). In our experiments, we chose
the second approach, since the number of concepts in
each document of the dataset is not constant. Given the
evaluation performed on the development set, we set that
threshold to 0.5.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.8. Results</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>Results show that the best performances are reached
when the title is included in the text (see the rows without
“not”) with the exception brought by the simple use of the
text without reordering. An interesting outcome is that
the experiment using title+random obtains very good
results when compared to the best configurations.</p>
      <p>On the contrary, using random text without the
title, or using the sole title results in a decrease in global
performance.</p>
      <p>In this approach, described in [27], the centroid of the
word vectors in the text is calculated, then a score is
assigned to each sentence based on their cosine distance
from the centroid. The closer a sentence is to the centroid,
the higher the score it will receive. In our approach, we
use fastText [28] for word embeddings.</p>
      <p>The words used to compute the centroid are those that
have been extracted as content words (see Section 4.4)
and have a TF-IDF higher than a certain threshold ,
which in this case was 0.3. The centroid is computed
as the mean of the word embeddings of the previously
selected words:
 =
∑︀∈ [idx()]</p>
      <p>||
where  is the corpus of words with tfidf( ) &gt; .</p>
      <p>Each sentence in the document gets transformed into a
unique embedding representation by averaging the sum
of the embedding vectors of each word in the sentence:
 =
∑︀∈ [idx()]</p>
      <p>| − |
where  is the -th sentence in document .
basic
basic-not
centroid
centroid-not
title-only
tfidf-max-doc
tfidf-max-lab
tfidf-mean-doc
tfidf-mean-lab
tfidf-max-doc-not
tfidf-max-lab-not
tfidf-mean-doc-not
tfidf-mean-lab-not
random
random-not</p>
      <p>By looking at the statistical significance, 11 we find out are used to select it.
that we can split, more or less, the experiments into two French results bring significantly lower accuracy: this
big groups: the ones that in the English part of the table is not expected and is probably due to the choice of the
have a DO 1 above 0.80 and the remaining ones that are BERT pre-trained model.
below 0.79. The exception is the “title-only”
configuration, which obtains lower accuracy in all languages and
contrasts with the results obtained in a similar previous 6. Release
work applied to Italian laws [3], where the use of the sole
title results in an increase in performance with respect The source code for all the experiments (from the retrieval
to the concatenation between title and text. of the documents to the training of the models), the data</p>
      <p>By listing the documents where EuroVoc labels are doonwthnelopardoejdecftroGmithEuUbRp-Laegxe,.1a2nd the models are available
not extracted correctly, it seems that in the European
legislation it is quite common to find very generic
titles. For instance, the title of the document with ID 7. Conclusions and Future Work
“CELEX:32011Q0624(01)” is “Rules of procedure for the
appeal committee (Regulation (EU) No 182/2011)”, from In this paper, we presented some approaches to perform
which is very hard to extract relevant information about document classification on long documents, by
reorderthe topic. One can find other similar documents, such ing their sentences before the fine-tuning phase. The
as “Action brought on 2 March 2011 — Attey v Council”, best results are obtained when all the 512 tokens allowed
title of law with ID “CELEX:62011TN0118”. in the BERT paradigm are filled, possibly including the</p>
      <p>In general, our experiments show that the classifica- title of the law.
tion of European laws obtains the best performance on In the future, we want to extend this approach to other
BERT when all the possible tokens are filled, possibly languages, trying to understand whether the same
reusing the title and some parts of the text. The high accu- ordering algorithm leads to some improvement in the
racy obtained in the experiments performed by randomly classification task. We will also investigate other
sumreordering the sentences demonstrates that the context marization approaches, or new architectures that rely on
is important per se, even when no particular strategies Local, Sparse, and Global attention [29] so that longer
texts (up to 16K tokens) can be used to train the model.
11To calculate statistical significance, a one-tailed -test with a
significance level of .05 was applied to the scores of the five runs, with
the null hypothesis that no diference is observed, and the
alternative hypothesis that the score obtained with the summarized text
is significantly greater than the one with the normal text.</p>
      <p>12https://github.com/bocchilorenzo/AutoEuroVoc</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <year>1810</year>
          .04805. arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          . [21]
          <string-name>
            <given-names>I.</given-names>
            <surname>Chalkidis</surname>
          </string-name>
          , E. Fergadiotis,
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          , N. Ale-
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>cessing Workshop</source>
          <year>2019</year>
          , Association for Computa-
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>tional Linguistics</surname>
          </string-name>
          , Minneapolis, Minnesota,
          <year>2019</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          pp.
          <fpage>78</fpage>
          -
          <lpage>87</lpage>
          . URL: https://aclanthology.org/W19-2209.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>doi:10</source>
          .18653/v1/
          <fpage>W19</fpage>
          -2209. [22]
          <string-name>
            <given-names>I.</given-names>
            <surname>Chalkidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fergadiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          , N. Ale-
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>EMNLP</source>
          <year>2020</year>
          ,
          <article-title>Association for Computational Lin-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>guistics</surname>
          </string-name>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>2898</fpage>
          -
          <lpage>2904</lpage>
          . URL: https://
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          aclanthology.org/
          <year>2020</year>
          .findings-emnlp.
          <volume>261</volume>
          . doi: 10.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <volume>18653</volume>
          /v1/
          <year>2020</year>
          .findings-emnlp.
          <volume>261</volume>
          . [23]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schweter</surname>
          </string-name>
          ,
          <article-title>Italian bert and electra models,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>2020. URL: https://doi.org/10.5281/zenodo.4263142.</mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>doi:10</source>
          .5281/zenodo.4263142. [24]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cañete</surname>
          </string-name>
          , G. Chaperon,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>and evaluation data</article-title>
          ,
          <source>in: PML4DC at ICLR</source>
          <year>2020</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <year>2020</year>
          . [25]
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Muller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J. O.</given-names>
            <surname>Suárez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dupont</surname>
          </string-name>
          , L. Ro-
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>Proceedings of the 58th Annual Meeting of the As-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>sociation for Computational Linguistics</source>
          ,
          <year>2020</year>
          . [26]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          , D. Chen,
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          approach, CoRR abs/
          <year>1907</year>
          .11692 (
          <year>2019</year>
          ). URL: http:
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          //arxiv.org/abs/
          <year>1907</year>
          .11692. arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          . [27]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rossiello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          , G. Semeraro, Centroid-based
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>2017 workshop on summarization and summary</mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <article-title>evaluation across source types and genres,</article-title>
          <year>2017</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          12-
          <fpage>21</fpage>
          . [28]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Grave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , T. Mikolov, En-
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <source>linguistics 5</source>
          (
          <year>2017</year>
          )
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          . [29]
          <string-name>
            <given-names>C.</given-names>
            <surname>Condevaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Harispe</surname>
          </string-name>
          , Lsg attention: Extrapola-
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>ery and Data Mining</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>443</fpage>
          -
          <lpage>454</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>