1. Introduction

Title is (Not) All You Need for EuroVoc Multi-Label Classification of European Laws

Lorenzo Bocchi

Alessio Palmero Aprosio

0 0 University of Trento , Italy

Machine Learning and Artificial Intelligence approaches within Public Administration (PA) have grown significantly in recent years. Specifically, new guidelines from various governments recommend employing the EuroVoc thesaurus for the classification of documents issued by the PA. In this paper, we explore some methods to perform document classification in the legal domain, in order to mitigate the length limitation for input texts in BERT models. We first collect data from the European Union, already tagged with the aforementioned taxonomy. Then we reorder the sentences included in the text, with the aim of bringing the most informative part of the document in the first part of the text. Results show that the title and the context are both important, although the order of the text may not. Finally, we release on GitHub both the dataset and the source code used for the experiments.

eol>EuroVoc taxonomy Sentence reordering Text classification

1. Introduction

CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Dec 04 — 06, 2024, Pisa, Italy * Corresponding author. † These authors contributed equally. $ lorenzo.bocchi@unitn.it (L. Bocchi); a.palmeroaprosio@unitn.it (A. Palmero Aprosio) 0000-0002-1484-0882 (A. Palmero Aprosio) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1https://bit.ly/eurovoc-ds 2https://bit.ly/eurovoc-conference

2. Related work

EUR-Lex are tagged with labels from this level. Each TC is linked to an MT, which is then part of a specific DO.

The version of EuroVoc used for our studies is 4.17, released on 31st January 2023, containing 7,382 TCs, 127 MTs, and 21 DOs.

There have been a number of studies that explored the classification of European legislation with EuroVoc labels.

JRC EuroVoc Indexer [4] is a tool that allows the categorization of documents with EuroVoc classifiers in 22 languages. The data used is contained in an old dataset 3.3. Dataset collection [5] with documents up to 2006. The algorithm used involves generating a collection of lemma frequencies and To collect the documents for our task, we built a set of weights. These frequencies are associated with specific tools written in Python that can be customized to obtain descriptors, referred to as associates or topic signatures diferent subsets of the data (year, language, etc.). In total, in the paper. When classifying a new document, the al- after filtering out the documents not tagged with EuroVoc gorithm selects the descriptors from the topic signatures or not containing an easy accessible text (for instance, old that exhibit the highest similarity to the lemma frequency documents only available as scanned PDFs), we collect list of the new document. around 1.1 million documents in four languages (English,

The research described in [6] explored the usage of Italian, Spanish, French).

Recurrent Neural Networks on extreme multi-label clas- As a subsequent task, we also removed labels that have sification datasets, including RCV1 [ 7], Amazon-13K been deprecated by the EuroVoc developers throughout [8], Wiki-30K and Wiki-500K [9], and an older EUR-Lex the years.4 Following previous work [11], we also remove dataset from 2007 [ 10 ]. labels having less than 10 examples.

In [11] the authors explore the usage of diferent deep- Finally, by looking at the data, we see that the labelling learning architectures. Furthermore, the authors also became consistent starting from 2004, while many depreleased a dataset of 57,000 tagged documents from EUR- recated labels are still present in documents, especially Lex. previous to 2010. We therefore consider only documents

There are also other monolingual studies on the topic, published in the interval 2010-2022. that mainly concentrate on Italian [ 12 ], Croatian [13], The final dataset will consist of 471,801 documents. and Portuguese [1]. On average, each law is labelled with 6 EuroVoc concepts.

More recent works on multi-language classification on Table 1 shows some statistics about the dataset used. EuroVoc are described in Chalkidis et al. [14], Shaheen et al. [15], and Wang et al. [16].

4. Experiments 4.1. Data split

EuroVoc’s hierarchical structure is divided into three layers: Thesaurus Concept (TC), Micro Thesaurus (MT, previously known as the “sub-sector” level), and Domain (DO, previously known as the “main sector” level).

Each layer contains descriptors for documents, covering a broad range of EU-related subjects such as law, economics, social afairs, and the environment, each at varying levels of detail. The TC level is the foundational layer where all key concepts reside, and documents on The primary source for European legislation is EUR-Lex3, a web portal ofering comprehensive access to EU legal To keep our experiments consistent with previous similar documents. It is available in all 24 oficial languages of approaches [17], we split the data into train, dev, and test the European Union and is updated daily by its Publica- sets with an approximate ratio of 80/10/10 in percentage, tions Ofice. Most documents on EUR-Lex are manually respectively. categorized using EuroVoc concepts. In order to make the training reproducible and to avoid that a single random extraction could be too (un)lucky, we 3.2. EuroVoc repeat the split using three diferent seeds and a pseudorandom number generator.

Each partition into train/dev/test is done using Iterative Stratification [ 18, 19], in order to preserve the concept balance.

Unless diferently specified, all the results in the rest of the paper refer to the average of the values obtained by our experiments on the three splits.

3https://eur-lex.europa.eu/ 4https://bit.ly/eurovoc-handbook 3. Dataset

3.1. EUR-Lex In this Section, we describe the experiments performed on the above-described data.

Total documents Documents with text and EuroVoc labels Number of EuroVoc labels used before filtering Number of EuroVoc labels having less than 10 documents Final number of labels Removed documents 195,236 118,296 6,098 2,070 4,028 3

Italian

4.2. Methodology 4.4. Pre-processing

Our models are trained using BERT [20] and its deriva- The text of the laws is preprocessed using spaCy,9 a Nattives. ural Language Processing pipeline that can extract infor

The choice of the best pre-trained model is very impor- mation from texts in 24 languages. In particular, we used tant for the accuracy of the classification using the model it to perform sentence splitting part-of-speech tagging, obtained after fine-tuning. In particular, [ 21] shows that and named-entities recognition, used to extract content classification tasks over the legal domain obtain better words from the text and perform the selection of the performance when pre-trained on legal corpora. Never- sentences that are used in the task. theless, in some preliminary experiments, we have tried BERT models pre-trained on various datasets (among 4.5. Summarization them, legal ones of course), but not always the results award models built from legal texts.

Although the diference was not statistically significant, we decided to use these models anyway (from HuggingFace5): Given that the input length for these BERT models is 512 tokens, while legislative texts are usually longer, summarizing the text by using the most important parts of it to make sure it fits in the input was seen as an important step to follow. • legal-bert-base-uncased [22], consisting As underlined in the Introduction, the text of a law is of 12 GB of diverse English legal text from sev- usually very redundant, and its most representative part eral fields (e.g., legislation, court cases, contracts) is often after a notable sequence of preambles. scraped from publicly available resources; Since the limit of 512 tokens is very strong if compared • bert-base-italian-xxl-cased [23], the to the usual length of a legal document, we concentrate main Italian BERT model, consisting of a recent our summarization efort on reordering the sentences Wikipedia dump and various texts from the inside a single document so that the most informative OPUS corpora collection6 and data from the part of the text can be brought to the beginning and Italian part of the OSCAR corpus;7 therefore included in the first 512 tokens.

We use two diferent approaches to reach the goal: TF-IDF and centroid-based. In both cases, we perform training with the sole text reordered and the concatena• bert-base-spanish-wwm-cased [24], also called BETO, is a BERT model trained on a big

Spanish corpus8 that consists of 3 billion words; • camembert-base [25], a state-of-the-art lan- tion of the title and the above text.

guage model for French based on the RoBERTa model [26]. 4.5.1. TF-IDF

4.3. Basic configurations

The basic configurations consist of using the sole title, the sole text, and the concatenation of the title and the text. Note that, apart from some rare outliers, title length is consistently less than 50 tokens.

5https://huggingface.co/ 6http://opus.nlpl.eu/ 7https://traces1.inria.fr/oscar/ 8https://bit.ly/big-spanish-corpora

TF-IDF (Term Frequency-Inverse Document Frequency) is a widely used technique in information retrieval and text mining to quantify the importance of terms in a document within a larger collection of documents. It aims to highlight terms that are both frequent within a document and relatively rare in the overall collection, thus capturing their discriminative power.

The TF-IDF score of a term in a document is calculated by multiplying two factors: the term frequency (TF) and

9https://spacy.io/

the inverse document frequency (IDF).

Let be the term and the document:

tf(, ) idf(, ) = =

, ∑︀′∈ ′, log

1 + |{ ∈ : ∈ }|

After obtaining the embedding for the sentence, its score is computed as the cosine similarity between the centroid and the embedding: sim(, ) = 1 −

· |||| × || || 4.5.2. Centroid

By using the previously described approach, every text was converted into a list of ranked sentences, each with its own score. where , is the frequency of term in document , and = || is the number of documents in the set .

Beyond the usual TF-IDF, we also perform a labelbased approach, that considers one document for each label, by concatenating all the texts belonging to the laws 4.6. Random having that label. Because of the obtained results (see Section 4.7), we also

Once all the documents have gone through this process, added two configurations that used a random ordering of the TF-IDF matrix is calculated using Tfidf Vectorizer the sentences (one concatenated with the title, the other from the Python package scikit-learn10 over the content one containing only the randomly ordered text). words (see Section 4.4) of the texts.

After obtaining the TF-IDF matrix, the final step is to assign a score to each sentence. For each valid base 4.7. Evaluation form, its score is determined from the TF-IDF matrix by selecting the highest value within the corresponding column (which represents a word). These scores are then added to a list for each sentence. Once a sentence is processed, the maximum or average score is calculated (“max” and “mean” in the results). This calculated value becomes the sentence’s score. The process is repeated for all sentences in every document.

The evaluation of our experiments is performed by using the F1 score, macro-averaged so that each label has the same weight (this metric awards models that perform better on less-represented labels). Since we are dealing with a multi-label classification task, we have to choose between considering always the same number of results ( @, @, 1@) or keeping only the labels whose confidence is higher than a particular threshold (usually between 0 and 1). In our experiments, we chose the second approach, since the number of concepts in each document of the dataset is not constant. Given the evaluation performed on the development set, we set that threshold to 0.5.

4.8. Results 5. Discussion

Results show that the best performances are reached when the title is included in the text (see the rows without “not”) with the exception brought by the simple use of the text without reordering. An interesting outcome is that the experiment using title+random obtains very good results when compared to the best configurations.

On the contrary, using random text without the title, or using the sole title results in a decrease in global performance.

In this approach, described in [27], the centroid of the word vectors in the text is calculated, then a score is assigned to each sentence based on their cosine distance from the centroid. The closer a sentence is to the centroid, the higher the score it will receive. In our approach, we use fastText [28] for word embeddings.

The words used to compute the centroid are those that have been extracted as content words (see Section 4.4) and have a TF-IDF higher than a certain threshold , which in this case was 0.3. The centroid is computed as the mean of the word embeddings of the previously selected words: = ∑︀∈ [idx()]

|| where is the corpus of words with tfidf( ) > .

Each sentence in the document gets transformed into a unique embedding representation by averaging the sum of the embedding vectors of each word in the sentence: = ∑︀∈ [idx()]

| − | where is the -th sentence in document . basic basic-not centroid centroid-not title-only tfidf-max-doc tfidf-max-lab tfidf-mean-doc tfidf-mean-lab tfidf-max-doc-not tfidf-max-lab-not tfidf-mean-doc-not tfidf-mean-lab-not random random-not

By looking at the statistical significance, 11 we find out are used to select it. that we can split, more or less, the experiments into two French results bring significantly lower accuracy: this big groups: the ones that in the English part of the table is not expected and is probably due to the choice of the have a DO 1 above 0.80 and the remaining ones that are BERT pre-trained model. below 0.79. The exception is the “title-only” configuration, which obtains lower accuracy in all languages and contrasts with the results obtained in a similar previous 6. Release work applied to Italian laws [3], where the use of the sole title results in an increase in performance with respect The source code for all the experiments (from the retrieval to the concatenation between title and text. of the documents to the training of the models), the data

By listing the documents where EuroVoc labels are doonwthnelopardoejdecftroGmithEuUbRp-Laegxe,.1a2nd the models are available not extracted correctly, it seems that in the European legislation it is quite common to find very generic titles. For instance, the title of the document with ID 7. Conclusions and Future Work “CELEX:32011Q0624(01)” is “Rules of procedure for the appeal committee (Regulation (EU) No 182/2011)”, from In this paper, we presented some approaches to perform which is very hard to extract relevant information about document classification on long documents, by reorderthe topic. One can find other similar documents, such ing their sentences before the fine-tuning phase. The as “Action brought on 2 March 2011 — Attey v Council”, best results are obtained when all the 512 tokens allowed title of law with ID “CELEX:62011TN0118”. in the BERT paradigm are filled, possibly including the

In general, our experiments show that the classifica- title of the law. tion of European laws obtains the best performance on In the future, we want to extend this approach to other BERT when all the possible tokens are filled, possibly languages, trying to understand whether the same reusing the title and some parts of the text. The high accu- ordering algorithm leads to some improvement in the racy obtained in the experiments performed by randomly classification task. We will also investigate other sumreordering the sentences demonstrates that the context marization approaches, or new architectures that rely on is important per se, even when no particular strategies Local, Sparse, and Global attention [29] so that longer texts (up to 16K tokens) can be used to train the model. 11To calculate statistical significance, a one-tailed -test with a significance level of .05 was applied to the scores of the five runs, with the null hypothesis that no diference is observed, and the alternative hypothesis that the score obtained with the summarized text is significantly greater than the one with the normal text.

12https://github.com/bocchilorenzo/AutoEuroVoc

1810 .04805. arXiv: 1810 . 04805 . [21]

Chalkidis , E. Fergadiotis,

Malakasiotis , N. Ale-

cessing Workshop 2019 , Association for Computa-

tional Linguistics , Minneapolis, Minnesota, 2019 ,

pp. 78 - 87 . URL: https://aclanthology.org/W19-2209.

doi:10 .18653/v1/ W19 -2209. [22]

Chalkidis ,

Fergadiotis ,

Malakasiotis , N. Ale-

EMNLP 2020 , Association for Computational Lin-

guistics , Online, 2020 , pp. 2898 - 2904 . URL: https://

aclanthology.org/ 2020 .findings-emnlp. 261 . doi: 10.

18653 /v1/ 2020 .findings-emnlp. 261 . [23]

Schweter , Italian bert and electra models,

2020. URL: https://doi.org/10.5281/zenodo.4263142.

doi:10 .5281/zenodo.4263142. [24]

Cañete , G. Chaperon,

Fuentes ,

J.-H.

Ho ,

and evaluation data , in: PML4DC at ICLR 2020 ,

2020 . [25]

Martin ,

Muller ,

P. J. O.

Suárez ,

Dupont , L. Ro-

Proceedings of the 58th Annual Meeting of the As-

sociation for Computational Linguistics , 2020 . [26]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi , D. Chen,

approach, CoRR abs/ 1907 .11692 ( 2019 ). URL: http:

//arxiv.org/abs/ 1907 .11692. arXiv: 1907 . 11692 . [27]

Rossiello ,

Basile , G. Semeraro, Centroid-based

2017 workshop on summarization and summary

evaluation across source types and genres,

2017 , pp.

12- 21 . [28]

Bojanowski ,

Grave ,

Joulin , T. Mikolov, En-

linguistics 5 ( 2017 ) 135 - 146 . [29]

Condevaux ,

Harispe , Lsg attention: Extrapola-

ery and Data Mining , Springer, 2023 , pp. 443 - 454 .