1. Introduction

the oficial journal of records of the Italian government

KEVLAR: the Complete Resource for EuroVoc Classification of Legal Documents

Lorenzo Bocchi

Camilla Casula

Alessio Palmero Aprosio

1 0 Fondazione Bruno Kessler , Trento , Italy 1 University of Trento , Italy

2019

9 12

The use of Machine Learning and Artificial Intelligence in the Public Administration (PA) has increased in the last years. In particular, recent guidelines proposed by various governments for the classification of documents released by the PA suggest to use the EuroVoc thesaurus. In this paper, we present KEVLAR, an all-in-one solution for performing the above-mentioned task on acts belonging to the Public Administration. First, we create a collection of 8 million documents in 24 languages, tagged with EuroVoc labels, taken from EUR-Lex, the web portal of the European Union legislation. Then, we train diferent pre-trained BERT-based models, comparing the performance of base models with domain-specific and multilingual ones. We release the corpus, the best-performing models, and a Docker image containing the source code of the trainer, the REST API, and the web interface. This image can be employed out-of-the-box for document classification.

eol>EuroVoc taxonomy multilingual text classification BERT web interface

1. Introduction

EuroVoc is a multilingual and multidisciplinary thesaurus that has seen a significant rise in its use and importance in recent years. In particular, the taxonomy used in this thesaurus has become crucial for a number of activities of European Public Administrations, shaping the way information is organized, disseminated, and accessed.

Containing over 7,000 concepts, EuroVoc acts as a reliable and eficient indexing system for a vast range of documents, legislative texts, and reports. Due to this, a growing number of governmental institutions around Europe has begun to use it internally for document categorization.

The Spanish government, for instance, has suggested the adoption of EuroVoc since 2014 [ 1 ], and has more recently started using it regularly in its oficial open data portal,1 and in the Portal de la Administración Electrónica website.2 Similarly, German and French public administrations are following the same strategy, in the DCAT-AP.de3 and data.gouv.fr4 portals respectively.

Furthermore, Rovera et al. [ 2 ] presented a preliminary 1. First, we release a collection of more than 8 million documents from EUR-Lex, the European Union’s oficial web portal, which gives comprehensive access to EU legal documents, spanning more than 70 years of EU legislation (1948-2022), and covering 24 languages. Over half of these texts are already tagged with the corresponding BERT models in 22 diferent languages, which were fineEuroVoc concepts. tuned for the task. The source code in Python is publicly 2. Secondly, we perform a series of experiments for released, but cannot be used out-of-the-box and a known automatic tagging of the documents using the Eu- bug6 may have led to unreliable results. roVoc taxonomy, comparing diferent approaches Some similar recent works on multi-language classiand language models. ifcation are described in Chalkidis et al. [16], Shaheen 3. Finally, we develop a web interface (see Figure 1) et al. [17], and Wang et al. [18]. Outside of the EuroVoc and a REST API that anyone (citizen or public ecosystem, two large-sized legal datasets were released administration) could use both to easily try auto- by Niklaus et al. [19, 20] for language model creation. matic classification of documents and to integrate such categorization in any systems that might need it.

The models used for the web demo and the release are the best-performing ones we found, as described in Section 5. All the data and tools (the set of documents labeled with EuroVoc labels, the models, and the demo code) are freely available for download.

2. Related work

Several investigations have delved into the categorization of European legislation using EuroVoc labels. Notably, the task can be regarded as Extreme Multilabel Classification, as recognized in Liu et al. [ 5 ].

The JRC EuroVoc Indexer, detailed in Steinberger et al. [ 6 ], stands as a tool facilitating document categorization through EuroVoc classifiers across 22 languages. However, the dataset used for this tool [ 7 ] is limited to doc- Figure 2: Example of EuroVoc taxonomy. uments up to 2006. Their method entails the creation of lemma frequencies and associated weights, linked to specific descriptors referred to as associates or topic signatures in the research. When classifying a new document, 3. Dataset description the algorithm selects descriptors from the topic signatures exhibiting the highest resemblance to the lemma 3.1. EUR-Lex frequency list of the new document.

Later, You et al. [ 8 ] explored the application of Recur- The reference for European legislation is EUR-Lex7, a web rent Neural Networks (RNNs) to extreme multi-label clas- portal that grants users comprehensive access to EU legal sification datasets, encompassing RCV1 [ 9 ], Amazon-13K documents. It is available in all of the European Union’s [ 10 ], Wiki-30K, Wiki-500K [ 11 ], and an older EUR-Lex 24 oficial languages and is updated daily by its Publicadataset from 2007 [ 12 ]. Attention-based RNNs proved to tions Ofice. Most of the documents present in EUR-Lex be particularly efective, outperforming other methods are manually categorized using EuroVoc concepts. in 4 out of 5 datasets.

Chalkidis et al. [ 13 ] explored diverse deep learning 3.2. EuroVoc architectures for this task. Among these, a fine-tuned BERT-base model [ 14 ] showed the highest performance, EuroVoc’s hierarchical structure is organized into three achieving a micro-averaged F1 score of 0.732 (considering diferent layers: Thesaurus Concept (TC), Micro Theall labels). Furthermore, they released a dataset consist- saurus (MT, previously referred to as “sub-sector” level), ing of 57,000 tagged documents from EUR-Lex.5 and Domain (DO, previously referred to as “main sec

One of the most complete contributions to document tor” level). The TC level is the base level, where all the classification using EuroVoc is PyEuroVoc, outlined in key concepts are found. The documents on EUR-Lex are Avram et al. [15]. This study employs various pre-trained tagged with labels from this level. Every TC is assigned

5https://bit.ly/eurlex57k 6https://bit.ly/pyeurovoc-bug 7https://eur-lex.europa.eu/

to an MT, which in turn is part of a specific DO. For example, the label “Confidentiality” 8 is assigned to the MT “Information and information processing”, which be- In this section we provide a detailed account of the exlongs to the DO concept “Education and communication”. periments conducted on document classification with Figure 2 shows a small subset of the EuroVoc taxonomy. respect to the EuroVoc taxonomy.

The experiments of this work have been launched on version 4.17 of EuroVoc. It contains 7,382 TCs, 127 MTs, 4.1. Deprecated labels and labels and 21 DOs. frequency

4. Experiments

3.3. Dataset collection The EuroVoc thesaurus was initially developed in the 1980s and has constantly been updated and revised. Some KEVLAR was collected by downloading the documents labels started being used much earlier than others, and from EUR-Lex. We built a set of tools written in Python some are even deprecated for modern use but are still that can be customized to obtain diferent subsets of the present in older documents.10 This means that certain data (year, language, etc.). topics could stop being used in the future, potentially

In total, 8,368,328 documents were collected in 24 lan- resulting in concept being replaced or merged with other guages, 5,158,438 of which are annotated with EuroVoc existing concepts in future releases of EuroVoc. descriptors, for a total of 32,021,783 tags. On average, 6.2 Figure 4 shows the total occurrences of deprecated tags are associated with each document. labels on a yearly basis. The result shows that from

After filtering out these documents, 9 around 1.1 million 2010 the usage of these labels decreased dramatically texts with EuroVoc labels are collected. compared to the previous decade.

Figure 3 shows the number of documents per year in In addition to this, in EuroVoc labels assignment there English. The blue bars show the total number of docu- is a strong imbalance in the data. For example, the most ments retrieved for the year, while the orange bars show frequent label in the Italian documents, “economic conthe number of documents that were labelled and have centration" with ID 69, is used more than 13,000 times, full text. The reduction is quite significant, especially while the least frequent ones were assigned to just one before the year 2000. document. 8http://eurovoc.europa.eu/92 9Laws without any EuroVoc concept associated are not useful for our study. Regarding documents available in PDF format only, one could extract the text from them using OCR: this could be done in future work. 10https://bit.ly/eurovoc-handbook repeat the split using three diferent seeds and a pseudorandom number generator.

Each partition into train/dev/test is done using Iterative Stratification [ 27, 28], in order to preserve the concept balance.

Unless diferently specified, all the results in the rest of the paper refer to the average of the values obtained by our experiments on the three seeds. 4.4. Training

To keep our experiments consistent with previous similar approaches (e.g. Avram et al. [15]), we split the data into train, dev, and test sets with an approximate ratio of 80/10/10, respectively.

In order to make the training reproducible and to avoid a single random extraction that could be too (un)lucky, we 11joelniklaus/legal-swiss-roberta-large en fr it es de

Base model bert-base-uncased flaubert/flaubert_base_uncased dbmdz/bert-base-italian-cased

Legal model nlpaueb/legal-bert-base-uncased joelniklaus/legal-french-roberta-base dlicari/Italian-Legal-BERT dccuchile/bert-base-spanish-wwm-cased

joelniklaus/legal-spanish-roberta-base bert-base-german-cased joelniklaus/legal-german-roberta-base

5. Discussion 6. Release and demo

All the data12 and models13 described in this paper are available for download under the CC-BY 4.0.

In addition to the documents, we also release on GitHub the code used to train and evaluate the models.14

Given that one of the main objectives of our research is to ofer a comprehensive solution for aiding public administrations in document classification, we have also shared the source code for a REST API and a demonstration interface system (see Figure 1), alongside a Docker image for efortless deployment.

While the training phase requires GPUs for optimal performance, the models discussed in this article – accessible through package installation via Docker – can be utilized eficiently with CPU processing. Upon tool installation, users have the flexibility to select the desired languages, allowing only necessary models to be downloaded and loaded into memory. 12https://bit.ly/kevlar-2024 13https://dh.fbk.eu/software/kevlar-models 14https://github.com/dhfbk/kevlar

A running instance of the API and the web demo is available for testing purposes.15

7. Conclusions and Future Work

In this paper, we release KEVLAR, an all-in-one solution for performing the document classification task on acts belonging to the Public Administration. We collected more than 8 million documents in 24 languages, compared diferent BERT and RoBERTa-based models on the classification of documents with respect to the EuroVoc taxonomy, and built an out-of-the-box tool for easily applying the classification to any text.

In the future, we will continue the exploration of novel methods to address this task with potentially better performance, for example using better-performing models or exploiting generation-based solutions. 15https://dh-server.fbk.eu/kevlar-ui/ en (base) en (legal) en (legal-ml) it (base) it (legal) it (legal-ml) fr (base) fr (legal) fr (legal-ml) de (base) de (legal) de (legal-ml) es (base) es (legal) es (legal-ml) nl (legal-ml) cs (legal-ml) da (legal-ml) et (legal-ml) fi (legal-ml) pt (legal-ml) hu (legal-ml) lt (legal-ml) sv (legal-ml) bg (legal-ml) el (legal-ml) ga (legal-ml) hr (legal-ml) lv (legal-ml) mt (legal-ml) pl (legal-ml) ro (legal-ml) sk (legal-ml) sl (legal-ml) 0,455 0,484 0,544 0,450 0,330 0,487 0,529 0,461 0,495 0,435 0,371 0,514 0,485 0,408 0,523 0,400 0,406 0,359 0,413 0,412 0,385 0,438 0,302 0,429 0,399 0,414 0,213 0,386 0,299 0,371 0,434 0,417 0,390 0,391 0,714 0,729 0,769 0,709 0,619 0,735 0,750 0,719 0,737 0,689 0,656 0,738 0,730 0,686 0,754 0,669 0,675 0,633 0,677 0,672 0,662 0,695 0,608 0,684 0,669 0,680 0,298 0,660 0,600 0,646 0,688 0,680 0,665 0,663 0,800 0,812 0,842 0,798 0,736 0,818 0,827 0,808 0,822 0,786 0,766 0,823 0,812 0,783 0,830 0,774 0,778 0,746 0,775 0,772 0,769 0,792 0,732 0,783 0,771 0,782 0,494 0,770 0,727 0,756 0,786 0,781 0,770 0,768 Papers), Association for Computational Linguistics, //aclanthology.org/2020.coling-main.598. doi:10. Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: 18653/v1/2020.coling-main.598. https://aclanthology.org/N19-1423. doi:10.18653/ [25] D. Licari, G. Comandè, ITALIAN-LEGAL-BERT: v1/N19-1423. A Pre-trained Transformer Language Model for [15] A. Avram, V. F. Pais, D. Tufis, Pyeurovoc: Italian Law, in: D. Symeonidou, R. Yu, D. Ceolin, A tool for multilingual legal document clas- M. Poveda-Villalón, D. Audrito, L. D. Caro, F. Grasso, sification with eurovoc descriptors, CoRR R. Nai, E. Sulis, F. J. Ekaputra, O. Kutz, N. Troquard abs/2108.01139 (2021). URL: https://arxiv.org/abs/ (Eds.), Companion Proceedings of the 23rd Interna2108.01139. arXiv:2108.01139. tional Conference on Knowledge Engineering and [16] I. Chalkidis, M. Fergadiotis, I. Androutsopoulos, Knowledge Management, volume 3256 of CEUR MultiEURLEX - a multi-lingual and multi-label le- Workshop Proceedings, CEUR, Bozen-Bolzano, Italy, gal document classification dataset for zero-shot 2022. URL: https://ceur-ws.org/Vol-3256/#km4law3, cross-lingual transfer, in: Proceedings of the iSSN: 1613-0073. 2021 Conference on Empirical Methods in Natu- [26] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aleral Language Processing, Association for Compu- tras, I. Androutsopoulos, LEGAL-BERT: The muptational Linguistics, Online and Punta Cana, Do- pets straight out of law school, in: Findings minican Republic, 2021, pp. 6974–6996. URL: https: of the Association for Computational Linguistics: //aclanthology.org/2021.emnlp-main.559. doi:10. EMNLP 2020, Association for Computational Lin18653/v1/2021.emnlp-main.559. guistics, Online, 2020, pp. 2898–2904. URL: https:// [17] Z. Shaheen, G. Wohlgenannt, E. Filtz, Large scale aclanthology.org/2020.findings-emnlp.261. doi: 10. legal text classification using transformer models, 18653/v1/2020.findings-emnlp.261. 2020. arXiv:2010.12871. [27] K. Sechidis, G. Tsoumakas, I. Vlahavas, On the [18] L. Wang, Y. W. Teh, M. A. Al-Garadi, Adopt- stratification of multi-label data, Machine Learning ing the multi-answer questioning task with an and Knowledge Discovery in Databases (2011) 145– auxiliary metric for extreme multi-label text 158. classification utilizing the label hierarchy, 2023. [28] P. Szymański, T. Kajdanowicz, A network perspecarXiv:2303.01064. tive on stratification of multi-label data, in: L. Torgo, [19] J. Niklaus, V. Matoshi, M. Stürmer, I. Chalkidis, D. E. B. Krawczyk, P. Branco, N. Moniz (Eds.), ProceedHo, Multilegalpile: A 689gb multilingual legal cor- ings of the First International Workshop on Learnpus, 2023. arXiv:2306.02069. ing with Imbalanced Domains: Theory and Applica[20] J. Niklaus, V. Matoshi, P. Rani, A. Galassi, tions, volume 74 of Proceedings of Machine Learning M. Stürmer, I. Chalkidis, Lextreme: A multi-lingual Research, PMLR, ECML-PKDD, Skopje, Macedonia, and multi-task benchmark for the legal domain, 2017, pp. 22–35.

2023. arXiv:2301.13126. [29] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, [21] H. Le, L. Vial, J. Frej, V. Segonne, M. Coavoux, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, B. Lecouteux, A. Allauzen, B. Crabbé, L. Besacier, Roberta: A robustly optimized bert pretraining apD. Schwab, Flaubert: Unsupervised language model proach, arXiv preprint arXiv:1907.11692 (2019). pre-training for french, in: Proceedings of The [30] I. Chalkidis, A. Jana, D. Hartung, M. Bommar12th Language Resources and Evaluation Confer- ito, I. Androutsopoulos, D. Katz, N. Aletras, ence, European Language Resources Association, LexGLUE: A benchmark dataset for legal lanMarseille, France, 2020, pp. 2479–2490. URL: https: guage understanding in English, in: Proceedings //www.aclweb.org/anthology/2020.lrec-1.302. of the 60th Annual Meeting of the Association [22] S. Schweter, Italian bert and electra models, for Computational Linguistics (Volume 1: Long 2020. URL: https://doi.org/10.5281/zenodo.4263142. Papers), Association for Computational Linguisdoi:10.5281/zenodo.4263142. tics, Dublin, Ireland, 2022, pp. 4310–4330. URL: [23] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, https://aclanthology.org/2022.acl-long.297. doi:10.

H. Kang, J. Pérez, Spanish pre-trained bert model 18653/v1/2022.acl-long.297. and evaluation data, in: PML4DC at ICLR 2020, [31] I. Beltagy, M. E. Peters, A. Cohan, Longformer: 2020. The long-document transformer, arXiv:2004.05150 [24] B. Chan, S. Schweter, T. Möller, German’s (2020).

next language model, in: Proceedings of [32] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, the 28th International Conference on Com- C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, putational Linguistics, International Commit- L. Yang, et al., Big bird: Transformers for longer tee on Computational Linguistics, Barcelona, sequences, Advances in neural information proSpain (Online), 2020, pp. 6788–6796. URL: https: cessing systems 33 (2020) 17283–17297. [33] I. Chalkidis, E. Fergadiotis, P. Malakasiotis, N. Aletras, I. Androutsopoulos, Extreme multi-label legal text classification: A case study in EU legislation, in: Proceedings of the Natural Legal Language Processing Workshop 2019, Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 78–87. URL: https://aclanthology.org/W19-2209. doi:10.18653/v1/W19-2209.

[1]

F.-J.

Martínez-Méndez ,

López-Carreño ,

J.-A.

Pastor-Sánchez , Open data en las administraciones públicas españolas: categorías temáticas y apps , Profesional de la información 23 ( 2014 ) 415 - 423 .

[2]

Rovera ,

A. P.

Aprosio ,

Greco ,

Lucchese ,

Tonelli ,

Antetomaso , Italian legislative text classification for Gazzetta Uficiale, AI per la Pubblica Amministrazione , at Ital-IA ( 2023 ).

[3]

T. D.

Prekpalaj , The role of key words and the use of the multilingual eurovoc thesaurus when searching for legal regulations of the republic of croatia - research results , in: 2021 44th International Convention on Information, Communication and Electronic Technology (MIPRO) , IEEE, 2021 , pp. 1470 - 1475 . doi: 10 .23919/MIPRO52101. 2021 . 9597043 .

[4]

Caled ,

Won ,

Martins ,

M. J.

Silva , A hierarchical label network for multi-label eurovoc classification of legislative contents, in: Digital Libraries for Open Knowledge: 23rd Interna-

[5]

Liu ,

W.-C.

Chang ,

Wu ,

Yang , Deep learning for extreme multi-label text classification , in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR '17, Association for Computing Machinery, New York, NY, USA, 2017 , p. 115 - 124 . URL: https://doi.org/10.1145/3077136. 3080834. doi: 10 .1145/3077136.3080834.

[6]

Steinberger ,

Ebrahim ,

Turchi , Jrc eurovoc indexer jex-a freely available multi-label categorisation tool , arXiv preprint arXiv: 1309 .5223 ( 2013 ).

[7]

Steinberger ,

Pouliquen ,

Widiger ,

Ignat ,

Erjavec ,

Tufiş ,

Varga , The JRCAcquis: A multilingual aligned parallel corpus with 20+ languages , in: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC'06) , European Language Resources Association (ELRA) , Genoa, Italy, 2006 . URL: http://www.lrec-conf.org/proceedings/ lrec2006/pdf/340_pdf.pdf.

[8]

You ,

Zhang ,

Wang ,

Dai ,

Mamitsuka ,

Zhu , Attentionxml: Label tree-based attentionaware deep model for high-performance extreme multi-label text classification , Advances in Neural Information Processing Systems 32 ( 2019 ).

[9]

D. D.

Lewis ,

Yang ,

T. G.

Rose ,

Li , Rcv1: A new benchmark collection for text categorization research , J. Mach. Learn. Res . 5 ( 2004 ) 361 - 397 .

[10] J. McAuley , J. Leskovec , Hidden factors and hidden topics: Understanding rating dimensions with review text , in: Proceedings of the 7th ACM Conference on Recommender Systems , RecSys '13, Association for Computing Machinery, New York, NY, USA, 2013 , p. 165 - 172 . URL: https://doi.org/ 10.1145/2507157.2507163. doi: 10 .1145/2507157. 2507163.

[11]

Zubiaga , Enhancing navigation on wikipedia with social tags , arXiv preprint arXiv:1202.5469 ( 2012 ).

[12]

Loza Mencía ,

Fürnkranz , Eficient multilabel classification algorithms for large-scale problems in the legal domain , 2010 . URL: http://dx.doi. org/10.1007/978-3- 642 -12837-0_ 11 . doi: 10 .1007/ 978-3- 642 -12837-0_ 11 .

[13]

Chalkidis ,

Fergadiotis ,

Malakasiotis , I. Androutsopoulos , Large-scale multi-label text classification on eu legislation , arXiv preprint arXiv: 1906 . 02192 ( 2019 ).

[14]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: Pre-training of deep bidirectional transformers for language understanding , in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short