=Paper=
{{Paper
|id=Vol-3651/DARLI-AP_paper5
|storemode=property
|title=Extreme Classification of European Union Law Documents Driven by Entity Embeddings
|pdfUrl=https://ceur-ws.org/Vol-3651/DARLI-AP-5.pdf
|volume=Vol-3651
|authors=Irene Benedetto,Luca Cagliero,Francesco Tarasconi
|dblpUrl=https://dblp.org/rec/conf/edbt/BenedettoCT24
}}
==Extreme Classification of European Union Law Documents Driven by Entity Embeddings==
Extreme Classification of European Union Law Documents
driven by Entity Embeddings
Irene Benedetto1,2,* , Luca Cagliero1 and Francesco Tarasconi2
1
Politecnico di Torino, Dipartimento di Automatica e Informatica, Corso Duca degli Abruzzi 24, 10129 Torino, Italy
2
MAIZE, Via San Quintino 31, 10121 Torino, Italy
Abstract
Extreme Multi-label Classification (XMC) is the task of labeling documents with one or more labels from a large set of classes. In
the context of Legal Artificial Intelligence, XMC is relevant to the automatic categorization of documents as they commonly address
several orthogonal categorization schemes. Since retrieving a sufficient number of training document examples per class is challenging,
XMC models are expected to be particularly effective in zero-shot learning scenarios. Existing approaches rely on transformer-based
classification models, which leverage the attention mechanism to attend to specific textual units. However, classical attention scores are
not able to differentiate between domain-specific and generic textual units. In this paper, we propose to use a legal entity-aware approach
to zero-shot XMC of European Union law documents. By integrating information about domain-specific legal entities we ease the
detection of label-sensitive information and prevent XMC models from attending to irrelevant or wrong text spans. The results achieved
on the law documents available in the EURLex benchmark show that our approach is superior to both previous transformer-based
approaches and opensource Large Language Models.
Keywords
Legal Artificial Intelligence, Extreme Multi-label Classification, Language Models, Law Documents
1. Introduction anism to attend to the most salient textual units. Since atten-
tion scores do not differentiate between legal and general-
The task of eXtreme Multi-label Classification (XMC) aims purpose textual units, the capabilities of transformers to
at assigning to a given text one or more pertinent labels correctly assign law document categories can be limited,
shortlisted from a very large set of classes. Since some of particularly in zero-shot learning contexts. To overcome this
the target classes are likely to be underrepresented or even issue, we propose to adopt an entity-aware attention mech-
absent in the training data, classifiers used for XMC are anism based on the LUKE transformer [11], which exploits
expected to be particularly effective in zero-shot learning the semantic characteristics of the domain by the means
scenarios [1, 2]. of entity embeddings, to enhance zero-shot classification.
Transformer-based architectures have shown to be partic- The key idea is to mainly consider the textual dependencies
ularly effective in tackling XMC [1] in various application with the tokens associated with entities as they are most
domains such as e-commerce [3], medical diagnosis [4] and likely to be discriminating in law document classification.
legal AI [5]. This paper focuses on solving the XMC task in a The experiments carried out on the EURLex benchmark
particular legal sub-domain, i.e., the automatic classification dataset [9] confirm the effectiveness of entity embeddings
of law documents. in enhancing zero-shot XMC performance. Notably, the
Legal documents such as laws have peculiar character- proposed approach not only performs better than existing
istics that make the classification task inherently complex. transformer-based methods but also turns out to be more
Firstly, the vocabulary used is very technical and rich of effective than an opensource Large Language Model with a
domain-specific expressions and entities [6]. Secondly, legal larger number of parameters, i.e., Llama 2 7B [12].
documents likely have a peculiar structure making content The remainder of this work is organized as follows. Sec-
retrieval and ranking particularly challenging [7]. Lastly, tion 2 reviews the existing literature, Section 3 describes
the contained text is often verbose as usually contains a lot the methodology, Section 4 presents the main experimen-
of preliminaries or repetitions [8]. tal results whereas Section 5 draws the conclusions of this
Benchmark datasets for law classification such as EU- work.
RLex [5, 9] contain acts and proposals of the European legis-
lation. To support their retrieval and exploration law docu-
ments are often annotated by Publication Offices with a very 2. Related work
large number of labels (e.g., 4,271 labels in EURLex), which
encompass frequent labels as well as few- and zero-shot Legal document classification. The most common case
ones. Therefore, automating the process of law document of document classification in the legal domain is the auto-
classification requires the use of accurate XMC models. matic categorization of court cases, where the goal is to
In this paper we aim at overcoming the main limitations predict the law area of the given case. Existing related
of existing transformer-based approaches to law document works mainly focused on employing machine learning and
classification (e.g., [10]), which leverage the attention mech- deep learning solutions [13, 14, 15, 16]. Parallel studies have
delved into the automatic text classification of legislation to
Published in the Proceedings of the Workshops of the EDBT/ICDT 2024 discern the law topic, with a particular emphasis on mono-
Joint Conference (March 25-28, 2024), Paestum, Italy lingual datasets [10, 17, 18, 19, 20, 21, 22]. A more limited
*
Corresponding author. body of work has explored multi-lingual datasets of legis-
$ irene.benedetto@{polito.it,maize.io} (I. Benedetto);
lations [9]. Specifically, the work presented in [21] investi-
luca.cagliero@polito.it (L. Cagliero); francesco.tarasconi@maize.io
(F. Tarasconi) gates the semantic relationship between each document and
0000-0001-7086-7898 (I. Benedetto); 0000-0002-7185-5247 labels. However, their performance on English documents is
(L. Cagliero) limited. Conversely, the transformer-based approaches pro-
© 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
posed in [9, 10, 18] are, to the best of our knowledge, state-of- classification dataset. It consists of 65k European Union
the-art on English-written law documents. Unlike [9, 10, 18], (EU) laws annotated with the EUROVOC taxonomy labels.
our work focuses on leveraging entity information in law The EUROVOC taxonomy is a multilingual classification
classification. To the best of our knowledge, the idea to and thesaurus system used by the European Union. This
boost the performance of transformer-based approaches to tool is designed to organize and categorize concepts and
law document classification using entity embeddings has terms used in official EU documents, facilitating research
not been addressed in literature so far. and access to information. Each european act in the EURLEX
dataset is associated to one or more EUROVOC concept.
Transformers in Legal Artificial Intelligence. Similar to [9] we focused on third level labels. For training
Transformer-based models have demonstrated promising and test our models we follow the dataset split provided by
results in several areas of legal AI. Specifically, pre-trained the respective authors.
language models have proved to be effective in tackling
various downstream tasks [18, 23, 24]. Specifically, they Competitors. We compare our methodology with:
encompass legal entity recognition [6], legal question
answering [7], and legal document summarization [8]. • Logistic Regression: A baseline consisting of a
Language Models have been designed and fine-tuned for Term Frequency-Inverse Document Frequency (TF-
the legal domain as well, mainly on Chinese documents. For IDF) encoder, counting both local and global frequen-
example, LaWGPT [25] is pre-trained using a large-scale cies of occurrence of the input tokens, and a logistic
Chinese legal text database. Lawyer LLaMA [26] is a Chi- regression model trained on top of the encoded text.
nese Legal Large Language Model (LLM) that undergoes • RoBERTa [9]: builds on BERT [28] removing the
training on a substantial legal dataset. This model is capable next-sentence pre-training objective and training
of offering legal advice, analyzing legal cases, and generat- with much larger mini-batches and learning rates;
ing legal articles. ChatLaw [27] comprises a collection of • LLama 2 7B [12]: a pre-trained Large Language
open-source legal LLMs in Chinese, including models like Model with approximately 7 billion parameters that
ChatLaw-13B and ChatLaw-33B. These models are trained showcases remarkable performance in both few-shot
on a vast dataset encompassing legal news, forums, and and zero-shot scenarios. Analogously to [29], to
judicial interpretations. Existing legal LLMs are suited to compare with LLMs we treated the XMC task as a
Chinese documents only and are not specifically designed generative problem.
to tackle the eXtreme Multi-label Classification task.
Experimental setting. We finetuned the base version of
LUKE model (studio-ousia/luke-base), for 10 epochs. This
3. Methodology model was trained with AdamW optimizer [30] with a
weight decay of 0.01 and a learning rate of 1e-5. During
In this section, we describe the proposed methodology for
training, we applied a 0.1 probability of dropout on classifi-
eXtreme Multi-label Classification (XMC) of law documents.
cation layer.
Our purpose is to tackle XMC in a zero-shot setting, i..e.,
For the sake of fairness, LLama 2 7B has been trained with
in the absence of ad hoc training examples. To address
Parameter-efficient fine-tuning (PEFT) [31], LoRA [32] that
this issue, we propose to recognize and use entity embed-
freezes pre-trained model weights and introduces trainable
dings in the document text. Specifically, we leverage the
rank decomposition matrices into each layer of the models
pre-trained LUKE model [11] for the classification task by
architecture.
replacing the original classification layer with one trained
We trained the 8-bit quantized version of this model for a
from scratch on the benchmark dataset. LUKE is a pre-
maximum of 3 epochs, with a learning rate of 1.4e-5, LORA
trained contextualized representation of words and entities
𝛼 = 16 and 𝑟 = 64.
based on transformer architecture. It produces the contex-
tualized representations of both words and entities thanks
to the entity-aware self-attention mechanism, an extension Metrics. Here we describe the various metrics used to
of the self-attention mechanism when computing attention evaluate the performance of the models in our study.
scores. • R@5 and P@5: precision and recall at 𝑘 predictions
Given a sequence of input vectors x1 , x2 , ..., x𝑘 , where where 𝑘 is equal to 5 in our dataset. It corresponds
x𝑖 ∈ R𝐷 , the attention score 𝑒𝑖𝑗 is computed as follows: to the mean number of labels in the training set.
Kx⊤
⎧
⎪
⎪ 𝑗 Qx𝑖 , if both x𝑖 and x𝑗 are words
Precision@k =
TP𝑘
⎨Kx⊤ Q x ,
⎪
𝑤2𝑒 𝑖 if x𝑖 is word and x𝑗 is entity TP𝑘 + FP𝑘
𝑗
𝑒𝑖𝑗 = ⊤
⎪ Kx𝑗 Q𝑒2𝑤 x𝑖 , if x𝑖 is entity and x𝑗 is word TP𝑘
⎪
⎪ Recall@k =
⎩ ⊤
Kx𝑗 Q𝑒2𝑒 x𝑖 , if both x𝑖 and x𝑗 are entities TP𝑘 + FN𝑘
• mRP: for each document, the metric ranks the la-
where Q𝑤2𝑒 , Q𝑒2𝑤 , Q𝑒2𝑒 ∈ R𝐿×𝐷 are query matrices, bels selected by the model by decreasing confidence,
K ∈ R𝐿×𝐷 is key matrix. computes Precision@𝑘, where 𝑘 is the document’s
number of gold labels, and then averages the results
over documents.
4. Experiments
Hardware. We conducted all the experiments on a single
Dataset. In our experiments, we consider the English
NVidia® Tesla® V100 GPU with 16 GB of memory, running
portion of EURLEX dataset [9], a multi-label legal document
on Ubuntu 22.04 LTS.
4.1. Results Table 2
Comparison in zero-shot learning context
Performance comparison with different training
strategies. We conducted experiments with different R@5 P@5
training procedures in order to test the performance of the Logistic Regression 0.001 0.001
proposed methodology and to compare it with that of differ- State-of-the-art [9] 0.028 0.006
ent architectures. To this end, we first freezed the 9 attention Our approach 0.087 0.164
blocks and fine-tune the classification layer to test the good- LLama 2 7B [12] 0.253 0.056
ness of the hidden representation of our model. Secondly,
we perform an end-to-end evaluation of the proposed model
to fully assess its potential.
models, considering the last attention layer1 . We sorted the
Table 1 results in decreasing order, ranking the tokens according to
Models comparison the attention given by the model. Then, separately for each
class 𝑐 ∈ 𝐶, the Mean Reciprocal Rank (MRR) of model 𝑚𝑖
Models mRP with the most frequent 𝑘 tokens of class 𝑐 was computed,
Logistic Regression 0.21 i.e. :
State-of-the-art [9] (first 9 blocks frozen) 0.27
Our approach (first 9 blocks frozen) 0.33 MRR𝑚𝑖 ,𝑐,𝑘 = MRR(R𝑎(𝑚𝑖 ),𝑘𝑐 ) (1)
State-of-the-art [9] (end-to-end training) 0.67 where R𝑎(𝑚𝑖 ) is the model 𝑚𝑖 attention ranking position of
Our approach (end-to-end training) 0.68
𝑘 most frequent tokens of class 𝑐 ∈ 𝐶.
LLama 2 7B [12] 0.65 We then compute the MRR difference between our model
and the state-of-the-art model for different values of 𝑘:
Table 1 reports the overall performance of our model
1 ∑︁
with different training strategies. Our results show that the MRR𝑘 = (MRRLUKE,𝑐,𝑘 − MRRSOTA,𝑐,𝑘 ) (2)
|𝐶| 𝑐∈𝐶
proposed method performs better than both the state-of-the-
art model and the Large Language Model Llama 2. Notably,
the model attains superior performance compared to the where
state-of-the-art competitor even when the first 9 attention • MRRLUKE,𝑐,𝑘 is the Mean Reciprocal Rank computed
blocks are kept fixed. This suggests the efficacy of our model with the LUKE model ranking, for class 𝑐 ∈ 𝐶 con-
in generating highly informative hidden representations sidering the 𝐾 most frequent term.
that enhance the classification task. • MRRSOTA,𝑐,𝑘 is the Mean Reciprocal Rank computed
with the state-of-the-art model ranking, for class
Zero-shot performance comparison. We conducted a 𝑐 ∈ 𝐶 considering the 𝐾 most frequent term.
comparative analysis of the performance of our model and
competitors on zero-shot labels (i.e. labels not present in These values are reported in Figures 1 and 2 which con-
the training set). In this case, we trained all models without sider the frequent and zero-shot labels, respectively.
employing any freezing of model layers. Scores above zero indicate that, on average, our model
We report the results in Table 2 in terms of Precision@5 is giving more attention to the most frequent terms of the
and Recall@5. Our evaluation focuses on evaluate the classes. These results reveal that our model is giving more
model’s ability to retrieve all relevant results without any attention to terms more frequently appear in each class,
knowledge about labels. The number of predictions consid- especially in correspondence of zero-shot labels, although
ered is always five, in compliance with [20]. differences decreases while 𝑘 increases.
Our results indicate that the baseline model performs
poorly in this zero-shot learning context, with very low
scores for both Precision@5 and Recall@5. The state-of-
the-art model exhibits slightly higher scores, but still per-
forms worse than the model proposed in this work. The
proposed method achieves significantly higher Precision@5
and Recall@5 scores, indicating its superiority over the other
two models in this zero-shot learning context. These re-
sults demonstrate the accuracy of our proposed model and
the completeness of the model’s predictions. Interestingly,
LLMs demonstrate superior Recall@5 performance, even
though their overall results are worse. Figure 1: Comparison of Mean Reciprocal Rank (MRR) Differ-
ences in Token Attention Scores between our proposed model
and the state-of-the-art model for various values of 𝑘 computed
Comparison between models’ attention. To further considering frequent labels. Positive scores indicate that our
support the efficacy of the entity-aware self-attention mech- model assigns higher attention to the most frequent terms of
anism for the given task, we examine the attention scores each class.
obtained by the best overall models according to the results
in Table 1. For each class we compute the mean tokens
attention score assigned by the state-of-the-art and LUKE 1
We consider the last attention head because is the closest to the classi-
fication layer.
bidirectional encoder representations from transform-
ers, in: NeurIPS 2019 Workshop on Science Meets
Engineering of Deep Learning, 2019.
[2] W. Chang, H. Yu, K. Zhong, Y. Yang, I. S.
Dhillon, A modular deep learning approach
for extreme multi-label text classification, CoRR
abs/1905.02331 (2019). URL: http://arxiv.org/abs/1905.
02331. arXiv:1905.02331.
[3] R. Agrawal, A. Gupta, Y. Prabhu, M. Varma, Multi-
label learning with millions of labels: Recommend-
Figure 2: Comparison of Mean Reciprocal Rank (MRR) Differ- ing advertiser bid phrases for web pages, in: Pro-
ences in Token Attention Scores between our proposed model ceedings of the 22nd International Conference on
and the state-of-the-art model for various values of 𝑘 computed World Wide Web, WWW ’13, Association for Com-
considering only zero-shot labels. Positive scores indicate that puting Machinery, New York, NY, USA, 2013, p. 13–24.
our model assigns higher attention to the most frequent terms URL: https://doi.org/10.1145/2488388.2488391. doi:10.
of each class. 1145/2488388.2488391.
[4] A. Johnson, T. Pollard, L. Shen, L.-w. Lehman, M. Feng,
M. Ghassemi, B. Moody, P. Szolovits, L. Celi, R. Mark,
5. Conclusion and future work Mimic-iii, a freely accessible critical care database,
Scientific Data 3 (2016) 160035. doi:10.1038/sdata.
In this paper we explored the use of an entity-aware 2016.35.
attention-based method to eXtreme Multi-label Classifica- [5] I. Chalkidis, E. Fergadiotis, P. Malakasiotis, I. An-
tion of law documents. We show that attending to entity- droutsopoulos, Large-scale multi-label text clas-
related tokens enhances the capability of the transformer to sification on EU legislation, in: Proceedings of
attend to class-related pieces of text. The proposed method the 57th Annual Meeting of the Association for
shows performance superior to both state-of-the-art trans- Computational Linguistics, Association for Compu-
formers and Large Language Models, achieving higher pre- tational Linguistics, Florence, Italy, 2019, pp. 6314–
cision and recall scores, especially in the most challenging 6322. URL: https://aclanthology.org/P19-1636. doi:10.
zero-shot learning context. The experiments also highlight 18653/v1/P19-1636.
the impact of different training strategies and the effec- [6] I. Angelidis, I. Chalkidis, M. Koubarakis, Named entity
tiveness of the proposed model in generating informative recognition, linking and generation for greek legisla-
hidden representations. tion, in: JURIX, 2018.
Based on the preliminary results, we envision the follow- [7] D. Hendrycks, C. Burns, A. Chen, S. Ball, CUAD: an
ing future research directions: expert-annotated NLP dataset for legal contract review,
CoRR abs/2103.06268 (2021). URL: https://arxiv.org/
• Cross-lingual Transfer: We plan to study the mod- abs/2103.06268. arXiv:2103.06268.
els’ performance in the zero-shot cross-lingual trans- [8] D. Jain, M. D. Borah, A. Biswas, Summarization of
fer scenario for legal text classification in languages legal documents: Where are we now and the way
other than English. forward, Computer Science Review 40 (2021) 100388.
• LLMs Fine-tuning Strategies: Another line of re- URL: https://www.sciencedirect.com/science/article/
search will be the exploration of additional LLM pii/S1574013721000289. doi:https://doi.org/10.
fine-tuning strategies that incorporate hierarchical 1016/j.cosrev.2021.100388.
clustering [29]. [9] I. Chalkidis, M. Fergadiotis, I. Androutsopoulos, Mul-
tieurlex – a multi-lingual and multi-label legal docu-
ment classification dataset for zero-shot cross-lingual
Acknowledgments transfer, 2021. URL: https://arxiv.org/abs/2109.00904.
The research leading to these results has been partially sup- doi:10.48550/ARXIV.2109.00904.
ported by the SmartData@PoliTO Center for Big Data Tech- [10] I. Chalkidis, E. Fergadiotis, P. Malakasiotis, N. Ale-
nologies. This study was partially carried out within the the tras, I. Androutsopoulos, Extreme multi-label le-
MICS (Made in Italy – Circular and Sustainable) Extended gal text classification: A case study in EU legisla-
Partnership and received funding from Next-GenerationEU tion, in: Proceedings of the Natural Legal Language
(Italian PNRR – M4 C2, Invest 1.3 – D.D. 1551.11-10-2022, Processing Workshop 2019, Association for Compu-
PE00000004) and within the FAIR - Future Artificial Intel- tational Linguistics, Minneapolis, Minnesota, 2019,
ligence Research - and received funding from the Euro- pp. 78–87. URL: https://aclanthology.org/W19-2209.
pean Union Next-GenerationEU (PNRR MISSIONE 4 COM- doi:10.18653/v1/W19-2209.
PONENTE 2, INVESTIMENTO 1.3 D.D. 1555 11/10/2022, [11] I. Yamada, A. Asai, H. Shindo, H. Takeda, Y. Mat-
PE00000013). This paper reflects only the authors’ views sumoto, Luke: Deep contextualized entity represen-
and opinions, neither the European Union nor the European tations with entity-aware self-attention, in: EMNLP,
Commission can be considered responsible for them. 2020.
[12] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-
hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava,
References S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen,
G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu,
[1] H.-F. Yu, K. Zhong, I. S. Dhillon, W.-C. Wang, Y. Yang, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn,
X-bert: extreme multi-label text classification using S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez,
M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, the 57th Annual Meeting of the Association for
M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Computational Linguistics, Association for Compu-
Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, tational Linguistics, Florence, Italy, 2019, pp. 1549–
Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, 1559. URL: https://aclanthology.org/P19-1150. doi:10.
A. Schelten, R. Silva, E. M. Smith, R. Subramanian, 18653/v1/P19-1150.
X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, [23] P. Henderson, M. S. Krass, L. Zheng, N. Guha, C. D.
P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kam- Manning, D. Jurafsky, D. E. Ho, Pile of law: Learning
badur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, responsible data filtering from the law and a 256GB
T. Scialom, Llama 2: Open foundation and fine-tuned open-source legal dataset, in: Thirty-sixth Conference
chat models, 2023. arXiv:2307.09288. on Neural Information Processing Systems Datasets
[13] O. Sulea, M. Zampieri, S. Malmasi, M. Vela, L. P. and Benchmarks Track, 2022. URL: https://openreview.
Dinu, J. van Genabith, Exploring the use of net/forum?id=3HCT3xfNm9r.
text classification in the legal domain, CoRR [24] S. Paul, A. Mandal, P. Goyal, S. Ghosh, Pre-training
abs/1710.09306 (2017). URL: http://arxiv.org/abs/1710. transformers on indian legal text, arXiv preprint
09306. arXiv:1710.09306. arXiv:2209.06049 (2022). URL: https://arxiv.org/abs/
[14] J. Gao, H. Ning, Z. Han, L. Kong, H. Qi, Legal text 2209.06049.
classification model based on text statistical features [25] H. Nguyen, A brief report on lawgpt 1.0: A vir-
and deep semantic features, in: P. M. 0001, T. M. 0001, tual legal assistant based on gpt-3, arXiv preprint
P. Majumder, M. Mitra (Eds.), Working Notes of FIRE arXiv:2302.05729 (2023).
2020 - Forum for Information Retrieval Evaluation, [26] Q. Huang, M. Tao, Z. An, C. Zhang, C. Jiang, Z. Chen,
Hyderabad, India, December 16-20, 2020, volume 2826 Z. Wu, Y. Feng, Lawyer llama technical report, arXiv
of CEUR Workshop Proceedings, CEUR-WS.org, 2020, preprint arXiv:2305.15062 (2023).
pp. 35–41. URL: http://ceur-ws.org/Vol-2826/T1-7.pdf. [27] J. Cui, Z. Li, Y. Yan, B. Chen, L. Yuan, Chatlaw:
[15] H. Chen, L. Wu, J. Chen, W. Lu, J. Ding, A com- Open-source legal large language model with inte-
parative study of automated legal text classification grated external knowledge bases, arXiv preprint
using random forests and deep learning, Informa- arXiv:2306.16092 (2023).
tion Processing & Management 59 (2022) 102798. [28] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT:
URL: https://www.sciencedirect.com/science/article/ pre-training of deep bidirectional transformers for
pii/S0306457321002764. doi:https://doi.org/10. language understanding, in: J. Burstein, C. Doran,
1016/j.ipm.2021.102798. T. Solorio (Eds.), Proceedings of the 2019 Conference
[16] A. Aguiar, R. Silveira, V. Pinheiro, V. Furtado, J. A. of the North American Chapter of the Association for
Neto, Text classification in legal documents extracted Computational Linguistics: Human Language Tech-
from lawsuits in brazilian courts, in: A. Britto, K. Val- nologies, NAACL-HLT 2019, Minneapolis, MN, USA,
divia Delgado (Eds.), Intelligent Systems, Springer In- June 2-7, 2019, Volume 1 (Long and Short Papers),
ternational Publishing, Cham, 2021, pp. 586–600. Association for Computational Linguistics, 2019, pp.
[17] E. Loza Mencía, J. Fürnkranz, Efficient Multilabel Clas- 4171–4186. URL: https://doi.org/10.18653/v1/n19-1423.
sification Algorithms for Large-Scale Problems in the doi:10.18653/v1/n19-1423.
Legal Domain, Springer-Verlag, Berlin, Heidelberg, [29] T. Jung, J.-K. Kim, S. Lee, D. Kang, Cluster-guided
2010, p. 192–215. label generation in extreme multi-label classification,
[18] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Ale- in: EACL 2023, 2023.
tras, I. Androutsopoulos, LEGAL-BERT: The mup- [30] I. Loshchilov, F. Hutter, Decoupled weight decay regu-
pets straight out of law school, in: Findings larization, in: International Conference on Learning
of the Association for Computational Linguistics: Representations, 2017.
EMNLP 2020, Association for Computational Lin- [31] H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang,
guistics, Online, 2020, pp. 2898–2904. URL: https: M. Bansal, C. Raffel, Few-shot parameter-efficient fine-
//aclanthology.org/2020.findings-emnlp.261. doi:10. tuning is better and cheaper than in-context learning,
18653/v1/2020.findings-emnlp.261. 2022. arXiv:2205.05638.
[19] C. Papaloukas, I. Chalkidis, K. Athinaios, D. Pantazi, [32] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang,
M. Koubarakis, Multi-granular legal topic classifi- W. Chen, Lora: Low-rank adaptation of large language
cation on greek legislation, CoRR abs/2109.15298 models, CoRR abs/2106.09685 (2021). URL: https://
(2021). URL: https://arxiv.org/abs/2109.15298. arxiv.org/abs/2106.09685. arXiv:2106.09685.
arXiv:2109.15298.
[20] I. Chalkidis, M. Fergadiotis, I. Androutsopoulos, Mul-
tieurlex - A multi-lingual and multi-label legal docu-
ment classification dataset for zero-shot cross-lingual
transfer, CoRR abs/2109.00904 (2021). URL: https:
//arxiv.org/abs/2109.00904. arXiv:2109.00904.
[21] X. Huang, B. Chen, L. Xiao, L. Jing, Label-
aware document representation via hybrid attention
for extreme multi-label text classification, CoRR
abs/1905.10070 (2019). URL: http://arxiv.org/abs/1905.
10070. arXiv:1905.10070.
[22] W. Zhao, H. Peng, S. Eger, E. Cambria, M. Yang,
Towards scalable and reliable capsule networks for
challenging NLP applications, in: Proceedings of