Extreme Classification of European Union Law Documents driven by Entity Embeddings Irene Benedetto1,2,* , Luca Cagliero1 and Francesco Tarasconi2 1 Politecnico di Torino, Dipartimento di Automatica e Informatica, Corso Duca degli Abruzzi 24, 10129 Torino, Italy 2 MAIZE, Via San Quintino 31, 10121 Torino, Italy Abstract Extreme Multi-label Classification (XMC) is the task of labeling documents with one or more labels from a large set of classes. In the context of Legal Artificial Intelligence, XMC is relevant to the automatic categorization of documents as they commonly address several orthogonal categorization schemes. Since retrieving a sufficient number of training document examples per class is challenging, XMC models are expected to be particularly effective in zero-shot learning scenarios. Existing approaches rely on transformer-based classification models, which leverage the attention mechanism to attend to specific textual units. However, classical attention scores are not able to differentiate between domain-specific and generic textual units. In this paper, we propose to use a legal entity-aware approach to zero-shot XMC of European Union law documents. By integrating information about domain-specific legal entities we ease the detection of label-sensitive information and prevent XMC models from attending to irrelevant or wrong text spans. The results achieved on the law documents available in the EURLex benchmark show that our approach is superior to both previous transformer-based approaches and opensource Large Language Models. Keywords Legal Artificial Intelligence, Extreme Multi-label Classification, Language Models, Law Documents 1. Introduction anism to attend to the most salient textual units. Since atten- tion scores do not differentiate between legal and general- The task of eXtreme Multi-label Classification (XMC) aims purpose textual units, the capabilities of transformers to at assigning to a given text one or more pertinent labels correctly assign law document categories can be limited, shortlisted from a very large set of classes. Since some of particularly in zero-shot learning contexts. To overcome this the target classes are likely to be underrepresented or even issue, we propose to adopt an entity-aware attention mech- absent in the training data, classifiers used for XMC are anism based on the LUKE transformer [11], which exploits expected to be particularly effective in zero-shot learning the semantic characteristics of the domain by the means scenarios [1, 2]. of entity embeddings, to enhance zero-shot classification. Transformer-based architectures have shown to be partic- The key idea is to mainly consider the textual dependencies ularly effective in tackling XMC [1] in various application with the tokens associated with entities as they are most domains such as e-commerce [3], medical diagnosis [4] and likely to be discriminating in law document classification. legal AI [5]. This paper focuses on solving the XMC task in a The experiments carried out on the EURLex benchmark particular legal sub-domain, i.e., the automatic classification dataset [9] confirm the effectiveness of entity embeddings of law documents. in enhancing zero-shot XMC performance. Notably, the Legal documents such as laws have peculiar character- proposed approach not only performs better than existing istics that make the classification task inherently complex. transformer-based methods but also turns out to be more Firstly, the vocabulary used is very technical and rich of effective than an opensource Large Language Model with a domain-specific expressions and entities [6]. Secondly, legal larger number of parameters, i.e., Llama 2 7B [12]. documents likely have a peculiar structure making content The remainder of this work is organized as follows. Sec- retrieval and ranking particularly challenging [7]. Lastly, tion 2 reviews the existing literature, Section 3 describes the contained text is often verbose as usually contains a lot the methodology, Section 4 presents the main experimen- of preliminaries or repetitions [8]. tal results whereas Section 5 draws the conclusions of this Benchmark datasets for law classification such as EU- work. RLex [5, 9] contain acts and proposals of the European legis- lation. To support their retrieval and exploration law docu- ments are often annotated by Publication Offices with a very 2. Related work large number of labels (e.g., 4,271 labels in EURLex), which encompass frequent labels as well as few- and zero-shot Legal document classification. The most common case ones. Therefore, automating the process of law document of document classification in the legal domain is the auto- classification requires the use of accurate XMC models. matic categorization of court cases, where the goal is to In this paper we aim at overcoming the main limitations predict the law area of the given case. Existing related of existing transformer-based approaches to law document works mainly focused on employing machine learning and classification (e.g., [10]), which leverage the attention mech- deep learning solutions [13, 14, 15, 16]. Parallel studies have delved into the automatic text classification of legislation to Published in the Proceedings of the Workshops of the EDBT/ICDT 2024 discern the law topic, with a particular emphasis on mono- Joint Conference (March 25-28, 2024), Paestum, Italy lingual datasets [10, 17, 18, 19, 20, 21, 22]. A more limited * Corresponding author. body of work has explored multi-lingual datasets of legis- $ irene.benedetto@{polito.it,maize.io} (I. Benedetto); lations [9]. Specifically, the work presented in [21] investi- luca.cagliero@polito.it (L. Cagliero); francesco.tarasconi@maize.io (F. Tarasconi) gates the semantic relationship between each document and  0000-0001-7086-7898 (I. Benedetto); 0000-0002-7185-5247 labels. However, their performance on English documents is (L. Cagliero) limited. Conversely, the transformer-based approaches pro- © 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings posed in [9, 10, 18] are, to the best of our knowledge, state-of- classification dataset. It consists of 65k European Union the-art on English-written law documents. Unlike [9, 10, 18], (EU) laws annotated with the EUROVOC taxonomy labels. our work focuses on leveraging entity information in law The EUROVOC taxonomy is a multilingual classification classification. To the best of our knowledge, the idea to and thesaurus system used by the European Union. This boost the performance of transformer-based approaches to tool is designed to organize and categorize concepts and law document classification using entity embeddings has terms used in official EU documents, facilitating research not been addressed in literature so far. and access to information. Each european act in the EURLEX dataset is associated to one or more EUROVOC concept. Transformers in Legal Artificial Intelligence. Similar to [9] we focused on third level labels. For training Transformer-based models have demonstrated promising and test our models we follow the dataset split provided by results in several areas of legal AI. Specifically, pre-trained the respective authors. language models have proved to be effective in tackling various downstream tasks [18, 23, 24]. Specifically, they Competitors. We compare our methodology with: encompass legal entity recognition [6], legal question answering [7], and legal document summarization [8]. • Logistic Regression: A baseline consisting of a Language Models have been designed and fine-tuned for Term Frequency-Inverse Document Frequency (TF- the legal domain as well, mainly on Chinese documents. For IDF) encoder, counting both local and global frequen- example, LaWGPT [25] is pre-trained using a large-scale cies of occurrence of the input tokens, and a logistic Chinese legal text database. Lawyer LLaMA [26] is a Chi- regression model trained on top of the encoded text. nese Legal Large Language Model (LLM) that undergoes • RoBERTa [9]: builds on BERT [28] removing the training on a substantial legal dataset. This model is capable next-sentence pre-training objective and training of offering legal advice, analyzing legal cases, and generat- with much larger mini-batches and learning rates; ing legal articles. ChatLaw [27] comprises a collection of • LLama 2 7B [12]: a pre-trained Large Language open-source legal LLMs in Chinese, including models like Model with approximately 7 billion parameters that ChatLaw-13B and ChatLaw-33B. These models are trained showcases remarkable performance in both few-shot on a vast dataset encompassing legal news, forums, and and zero-shot scenarios. Analogously to [29], to judicial interpretations. Existing legal LLMs are suited to compare with LLMs we treated the XMC task as a Chinese documents only and are not specifically designed generative problem. to tackle the eXtreme Multi-label Classification task. Experimental setting. We finetuned the base version of LUKE model (studio-ousia/luke-base), for 10 epochs. This 3. Methodology model was trained with AdamW optimizer [30] with a weight decay of 0.01 and a learning rate of 1e-5. During In this section, we describe the proposed methodology for training, we applied a 0.1 probability of dropout on classifi- eXtreme Multi-label Classification (XMC) of law documents. cation layer. Our purpose is to tackle XMC in a zero-shot setting, i..e., For the sake of fairness, LLama 2 7B has been trained with in the absence of ad hoc training examples. To address Parameter-efficient fine-tuning (PEFT) [31], LoRA [32] that this issue, we propose to recognize and use entity embed- freezes pre-trained model weights and introduces trainable dings in the document text. Specifically, we leverage the rank decomposition matrices into each layer of the models pre-trained LUKE model [11] for the classification task by architecture. replacing the original classification layer with one trained We trained the 8-bit quantized version of this model for a from scratch on the benchmark dataset. LUKE is a pre- maximum of 3 epochs, with a learning rate of 1.4e-5, LORA trained contextualized representation of words and entities 𝛼 = 16 and 𝑟 = 64. based on transformer architecture. It produces the contex- tualized representations of both words and entities thanks to the entity-aware self-attention mechanism, an extension Metrics. Here we describe the various metrics used to of the self-attention mechanism when computing attention evaluate the performance of the models in our study. scores. • R@5 and P@5: precision and recall at 𝑘 predictions Given a sequence of input vectors x1 , x2 , ..., x𝑘 , where where 𝑘 is equal to 5 in our dataset. It corresponds x𝑖 ∈ R𝐷 , the attention score 𝑒𝑖𝑗 is computed as follows: to the mean number of labels in the training set. Kx⊤ ⎧ ⎪ ⎪ 𝑗 Qx𝑖 , if both x𝑖 and x𝑗 are words Precision@k = TP𝑘 ⎨Kx⊤ Q x , ⎪ 𝑤2𝑒 𝑖 if x𝑖 is word and x𝑗 is entity TP𝑘 + FP𝑘 𝑗 𝑒𝑖𝑗 = ⊤ ⎪ Kx𝑗 Q𝑒2𝑤 x𝑖 , if x𝑖 is entity and x𝑗 is word TP𝑘 ⎪ ⎪ Recall@k = ⎩ ⊤ Kx𝑗 Q𝑒2𝑒 x𝑖 , if both x𝑖 and x𝑗 are entities TP𝑘 + FN𝑘 • mRP: for each document, the metric ranks the la- where Q𝑤2𝑒 , Q𝑒2𝑤 , Q𝑒2𝑒 ∈ R𝐿×𝐷 are query matrices, bels selected by the model by decreasing confidence, K ∈ R𝐿×𝐷 is key matrix. computes Precision@𝑘, where 𝑘 is the document’s number of gold labels, and then averages the results over documents. 4. Experiments Hardware. We conducted all the experiments on a single Dataset. In our experiments, we consider the English NVidia® Tesla® V100 GPU with 16 GB of memory, running portion of EURLEX dataset [9], a multi-label legal document on Ubuntu 22.04 LTS. 4.1. Results Table 2 Comparison in zero-shot learning context Performance comparison with different training strategies. We conducted experiments with different R@5 P@5 training procedures in order to test the performance of the Logistic Regression 0.001 0.001 proposed methodology and to compare it with that of differ- State-of-the-art [9] 0.028 0.006 ent architectures. To this end, we first freezed the 9 attention Our approach 0.087 0.164 blocks and fine-tune the classification layer to test the good- LLama 2 7B [12] 0.253 0.056 ness of the hidden representation of our model. Secondly, we perform an end-to-end evaluation of the proposed model to fully assess its potential. models, considering the last attention layer1 . We sorted the Table 1 results in decreasing order, ranking the tokens according to Models comparison the attention given by the model. Then, separately for each class 𝑐 ∈ 𝐶, the Mean Reciprocal Rank (MRR) of model 𝑚𝑖 Models mRP with the most frequent 𝑘 tokens of class 𝑐 was computed, Logistic Regression 0.21 i.e. : State-of-the-art [9] (first 9 blocks frozen) 0.27 Our approach (first 9 blocks frozen) 0.33 MRR𝑚𝑖 ,𝑐,𝑘 = MRR(R𝑎(𝑚𝑖 ),𝑘𝑐 ) (1) State-of-the-art [9] (end-to-end training) 0.67 where R𝑎(𝑚𝑖 ) is the model 𝑚𝑖 attention ranking position of Our approach (end-to-end training) 0.68 𝑘 most frequent tokens of class 𝑐 ∈ 𝐶. LLama 2 7B [12] 0.65 We then compute the MRR difference between our model and the state-of-the-art model for different values of 𝑘: Table 1 reports the overall performance of our model 1 ∑︁ with different training strategies. Our results show that the MRR𝑘 = (MRRLUKE,𝑐,𝑘 − MRRSOTA,𝑐,𝑘 ) (2) |𝐶| 𝑐∈𝐶 proposed method performs better than both the state-of-the- art model and the Large Language Model Llama 2. Notably, the model attains superior performance compared to the where state-of-the-art competitor even when the first 9 attention • MRRLUKE,𝑐,𝑘 is the Mean Reciprocal Rank computed blocks are kept fixed. This suggests the efficacy of our model with the LUKE model ranking, for class 𝑐 ∈ 𝐶 con- in generating highly informative hidden representations sidering the 𝐾 most frequent term. that enhance the classification task. • MRRSOTA,𝑐,𝑘 is the Mean Reciprocal Rank computed with the state-of-the-art model ranking, for class Zero-shot performance comparison. We conducted a 𝑐 ∈ 𝐶 considering the 𝐾 most frequent term. comparative analysis of the performance of our model and competitors on zero-shot labels (i.e. labels not present in These values are reported in Figures 1 and 2 which con- the training set). In this case, we trained all models without sider the frequent and zero-shot labels, respectively. employing any freezing of model layers. Scores above zero indicate that, on average, our model We report the results in Table 2 in terms of Precision@5 is giving more attention to the most frequent terms of the and Recall@5. Our evaluation focuses on evaluate the classes. These results reveal that our model is giving more model’s ability to retrieve all relevant results without any attention to terms more frequently appear in each class, knowledge about labels. The number of predictions consid- especially in correspondence of zero-shot labels, although ered is always five, in compliance with [20]. differences decreases while 𝑘 increases. Our results indicate that the baseline model performs poorly in this zero-shot learning context, with very low scores for both Precision@5 and Recall@5. The state-of- the-art model exhibits slightly higher scores, but still per- forms worse than the model proposed in this work. The proposed method achieves significantly higher Precision@5 and Recall@5 scores, indicating its superiority over the other two models in this zero-shot learning context. These re- sults demonstrate the accuracy of our proposed model and the completeness of the model’s predictions. Interestingly, LLMs demonstrate superior Recall@5 performance, even though their overall results are worse. Figure 1: Comparison of Mean Reciprocal Rank (MRR) Differ- ences in Token Attention Scores between our proposed model and the state-of-the-art model for various values of 𝑘 computed Comparison between models’ attention. To further considering frequent labels. Positive scores indicate that our support the efficacy of the entity-aware self-attention mech- model assigns higher attention to the most frequent terms of anism for the given task, we examine the attention scores each class. obtained by the best overall models according to the results in Table 1. For each class we compute the mean tokens attention score assigned by the state-of-the-art and LUKE 1 We consider the last attention head because is the closest to the classi- fication layer. bidirectional encoder representations from transform- ers, in: NeurIPS 2019 Workshop on Science Meets Engineering of Deep Learning, 2019. [2] W. Chang, H. Yu, K. Zhong, Y. Yang, I. S. Dhillon, A modular deep learning approach for extreme multi-label text classification, CoRR abs/1905.02331 (2019). URL: http://arxiv.org/abs/1905. 02331. arXiv:1905.02331. [3] R. Agrawal, A. Gupta, Y. Prabhu, M. Varma, Multi- label learning with millions of labels: Recommend- Figure 2: Comparison of Mean Reciprocal Rank (MRR) Differ- ing advertiser bid phrases for web pages, in: Pro- ences in Token Attention Scores between our proposed model ceedings of the 22nd International Conference on and the state-of-the-art model for various values of 𝑘 computed World Wide Web, WWW ’13, Association for Com- considering only zero-shot labels. Positive scores indicate that puting Machinery, New York, NY, USA, 2013, p. 13–24. our model assigns higher attention to the most frequent terms URL: https://doi.org/10.1145/2488388.2488391. doi:10. of each class. 1145/2488388.2488391. [4] A. Johnson, T. Pollard, L. Shen, L.-w. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Celi, R. Mark, 5. Conclusion and future work Mimic-iii, a freely accessible critical care database, Scientific Data 3 (2016) 160035. doi:10.1038/sdata. In this paper we explored the use of an entity-aware 2016.35. attention-based method to eXtreme Multi-label Classifica- [5] I. Chalkidis, E. Fergadiotis, P. Malakasiotis, I. An- tion of law documents. We show that attending to entity- droutsopoulos, Large-scale multi-label text clas- related tokens enhances the capability of the transformer to sification on EU legislation, in: Proceedings of attend to class-related pieces of text. The proposed method the 57th Annual Meeting of the Association for shows performance superior to both state-of-the-art trans- Computational Linguistics, Association for Compu- formers and Large Language Models, achieving higher pre- tational Linguistics, Florence, Italy, 2019, pp. 6314– cision and recall scores, especially in the most challenging 6322. URL: https://aclanthology.org/P19-1636. doi:10. zero-shot learning context. The experiments also highlight 18653/v1/P19-1636. the impact of different training strategies and the effec- [6] I. Angelidis, I. Chalkidis, M. Koubarakis, Named entity tiveness of the proposed model in generating informative recognition, linking and generation for greek legisla- hidden representations. tion, in: JURIX, 2018. Based on the preliminary results, we envision the follow- [7] D. Hendrycks, C. Burns, A. Chen, S. Ball, CUAD: an ing future research directions: expert-annotated NLP dataset for legal contract review, CoRR abs/2103.06268 (2021). URL: https://arxiv.org/ • Cross-lingual Transfer: We plan to study the mod- abs/2103.06268. arXiv:2103.06268. els’ performance in the zero-shot cross-lingual trans- [8] D. Jain, M. D. Borah, A. Biswas, Summarization of fer scenario for legal text classification in languages legal documents: Where are we now and the way other than English. forward, Computer Science Review 40 (2021) 100388. • LLMs Fine-tuning Strategies: Another line of re- URL: https://www.sciencedirect.com/science/article/ search will be the exploration of additional LLM pii/S1574013721000289. doi:https://doi.org/10. fine-tuning strategies that incorporate hierarchical 1016/j.cosrev.2021.100388. clustering [29]. [9] I. Chalkidis, M. Fergadiotis, I. Androutsopoulos, Mul- tieurlex – a multi-lingual and multi-label legal docu- ment classification dataset for zero-shot cross-lingual Acknowledgments transfer, 2021. URL: https://arxiv.org/abs/2109.00904. The research leading to these results has been partially sup- doi:10.48550/ARXIV.2109.00904. ported by the SmartData@PoliTO Center for Big Data Tech- [10] I. Chalkidis, E. Fergadiotis, P. Malakasiotis, N. Ale- nologies. This study was partially carried out within the the tras, I. Androutsopoulos, Extreme multi-label le- MICS (Made in Italy – Circular and Sustainable) Extended gal text classification: A case study in EU legisla- Partnership and received funding from Next-GenerationEU tion, in: Proceedings of the Natural Legal Language (Italian PNRR – M4 C2, Invest 1.3 – D.D. 1551.11-10-2022, Processing Workshop 2019, Association for Compu- PE00000004) and within the FAIR - Future Artificial Intel- tational Linguistics, Minneapolis, Minnesota, 2019, ligence Research - and received funding from the Euro- pp. 78–87. URL: https://aclanthology.org/W19-2209. pean Union Next-GenerationEU (PNRR MISSIONE 4 COM- doi:10.18653/v1/W19-2209. PONENTE 2, INVESTIMENTO 1.3 D.D. 1555 11/10/2022, [11] I. Yamada, A. Asai, H. Shindo, H. Takeda, Y. Mat- PE00000013). This paper reflects only the authors’ views sumoto, Luke: Deep contextualized entity represen- and opinions, neither the European Union nor the European tations with entity-aware self-attention, in: EMNLP, Commission can be considered responsible for them. 2020. [12] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma- hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, References S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, [1] H.-F. Yu, K. Zhong, I. S. Dhillon, W.-C. Wang, Y. Yang, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, X-bert: extreme multi-label text classification using S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, the 57th Annual Meeting of the Association for M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Computational Linguistics, Association for Compu- Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, tational Linguistics, Florence, Italy, 2019, pp. 1549– Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, 1559. URL: https://aclanthology.org/P19-1150. doi:10. A. Schelten, R. Silva, E. M. Smith, R. Subramanian, 18653/v1/P19-1150. X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, [23] P. Henderson, M. S. Krass, L. Zheng, N. Guha, C. D. P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kam- Manning, D. Jurafsky, D. E. Ho, Pile of law: Learning badur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, responsible data filtering from the law and a 256GB T. Scialom, Llama 2: Open foundation and fine-tuned open-source legal dataset, in: Thirty-sixth Conference chat models, 2023. arXiv:2307.09288. on Neural Information Processing Systems Datasets [13] O. Sulea, M. Zampieri, S. Malmasi, M. Vela, L. P. and Benchmarks Track, 2022. URL: https://openreview. Dinu, J. van Genabith, Exploring the use of net/forum?id=3HCT3xfNm9r. text classification in the legal domain, CoRR [24] S. Paul, A. Mandal, P. Goyal, S. Ghosh, Pre-training abs/1710.09306 (2017). URL: http://arxiv.org/abs/1710. transformers on indian legal text, arXiv preprint 09306. arXiv:1710.09306. arXiv:2209.06049 (2022). URL: https://arxiv.org/abs/ [14] J. Gao, H. Ning, Z. Han, L. Kong, H. Qi, Legal text 2209.06049. classification model based on text statistical features [25] H. Nguyen, A brief report on lawgpt 1.0: A vir- and deep semantic features, in: P. M. 0001, T. M. 0001, tual legal assistant based on gpt-3, arXiv preprint P. Majumder, M. Mitra (Eds.), Working Notes of FIRE arXiv:2302.05729 (2023). 2020 - Forum for Information Retrieval Evaluation, [26] Q. Huang, M. Tao, Z. An, C. Zhang, C. Jiang, Z. Chen, Hyderabad, India, December 16-20, 2020, volume 2826 Z. Wu, Y. Feng, Lawyer llama technical report, arXiv of CEUR Workshop Proceedings, CEUR-WS.org, 2020, preprint arXiv:2305.15062 (2023). pp. 35–41. URL: http://ceur-ws.org/Vol-2826/T1-7.pdf. [27] J. Cui, Z. Li, Y. Yan, B. Chen, L. Yuan, Chatlaw: [15] H. Chen, L. Wu, J. Chen, W. Lu, J. Ding, A com- Open-source legal large language model with inte- parative study of automated legal text classification grated external knowledge bases, arXiv preprint using random forests and deep learning, Informa- arXiv:2306.16092 (2023). tion Processing & Management 59 (2022) 102798. [28] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: URL: https://www.sciencedirect.com/science/article/ pre-training of deep bidirectional transformers for pii/S0306457321002764. doi:https://doi.org/10. language understanding, in: J. Burstein, C. Doran, 1016/j.ipm.2021.102798. T. Solorio (Eds.), Proceedings of the 2019 Conference [16] A. Aguiar, R. Silveira, V. Pinheiro, V. Furtado, J. A. of the North American Chapter of the Association for Neto, Text classification in legal documents extracted Computational Linguistics: Human Language Tech- from lawsuits in brazilian courts, in: A. Britto, K. Val- nologies, NAACL-HLT 2019, Minneapolis, MN, USA, divia Delgado (Eds.), Intelligent Systems, Springer In- June 2-7, 2019, Volume 1 (Long and Short Papers), ternational Publishing, Cham, 2021, pp. 586–600. Association for Computational Linguistics, 2019, pp. [17] E. Loza Mencía, J. Fürnkranz, Efficient Multilabel Clas- 4171–4186. URL: https://doi.org/10.18653/v1/n19-1423. sification Algorithms for Large-Scale Problems in the doi:10.18653/v1/n19-1423. Legal Domain, Springer-Verlag, Berlin, Heidelberg, [29] T. Jung, J.-K. Kim, S. Lee, D. Kang, Cluster-guided 2010, p. 192–215. label generation in extreme multi-label classification, [18] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Ale- in: EACL 2023, 2023. tras, I. Androutsopoulos, LEGAL-BERT: The mup- [30] I. Loshchilov, F. Hutter, Decoupled weight decay regu- pets straight out of law school, in: Findings larization, in: International Conference on Learning of the Association for Computational Linguistics: Representations, 2017. EMNLP 2020, Association for Computational Lin- [31] H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, guistics, Online, 2020, pp. 2898–2904. URL: https: M. Bansal, C. Raffel, Few-shot parameter-efficient fine- //aclanthology.org/2020.findings-emnlp.261. doi:10. tuning is better and cheaper than in-context learning, 18653/v1/2020.findings-emnlp.261. 2022. arXiv:2205.05638. [19] C. Papaloukas, I. Chalkidis, K. Athinaios, D. Pantazi, [32] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, M. Koubarakis, Multi-granular legal topic classifi- W. Chen, Lora: Low-rank adaptation of large language cation on greek legislation, CoRR abs/2109.15298 models, CoRR abs/2106.09685 (2021). URL: https:// (2021). URL: https://arxiv.org/abs/2109.15298. arxiv.org/abs/2106.09685. arXiv:2106.09685. arXiv:2109.15298. [20] I. Chalkidis, M. Fergadiotis, I. Androutsopoulos, Mul- tieurlex - A multi-lingual and multi-label legal docu- ment classification dataset for zero-shot cross-lingual transfer, CoRR abs/2109.00904 (2021). URL: https: //arxiv.org/abs/2109.00904. arXiv:2109.00904. [21] X. Huang, B. Chen, L. Xiao, L. Jing, Label- aware document representation via hybrid attention for extreme multi-label text classification, CoRR abs/1905.10070 (2019). URL: http://arxiv.org/abs/1905. 10070. arXiv:1905.10070. [22] W. Zhao, H. Peng, S. Eger, E. Cambria, M. Yang, Towards scalable and reliable capsule networks for challenging NLP applications, in: Proceedings of