=Paper= {{Paper |id=Vol-3651/DARLI-AP_paper5 |storemode=property |title=Extreme Classification of European Union Law Documents Driven by Entity Embeddings |pdfUrl=https://ceur-ws.org/Vol-3651/DARLI-AP-5.pdf |volume=Vol-3651 |authors=Irene Benedetto,Luca Cagliero,Francesco Tarasconi |dblpUrl=https://dblp.org/rec/conf/edbt/BenedettoCT24 }} ==Extreme Classification of European Union Law Documents Driven by Entity Embeddings== https://ceur-ws.org/Vol-3651/DARLI-AP-5.pdf
                         Extreme Classification of European Union Law Documents
                         driven by Entity Embeddings
                         Irene Benedetto1,2,* , Luca Cagliero1 and Francesco Tarasconi2
                         1
                             Politecnico di Torino, Dipartimento di Automatica e Informatica, Corso Duca degli Abruzzi 24, 10129 Torino, Italy
                         2
                             MAIZE, Via San Quintino 31, 10121 Torino, Italy


                                             Abstract
                                             Extreme Multi-label Classification (XMC) is the task of labeling documents with one or more labels from a large set of classes. In
                                             the context of Legal Artificial Intelligence, XMC is relevant to the automatic categorization of documents as they commonly address
                                             several orthogonal categorization schemes. Since retrieving a sufficient number of training document examples per class is challenging,
                                             XMC models are expected to be particularly effective in zero-shot learning scenarios. Existing approaches rely on transformer-based
                                             classification models, which leverage the attention mechanism to attend to specific textual units. However, classical attention scores are
                                             not able to differentiate between domain-specific and generic textual units. In this paper, we propose to use a legal entity-aware approach
                                             to zero-shot XMC of European Union law documents. By integrating information about domain-specific legal entities we ease the
                                             detection of label-sensitive information and prevent XMC models from attending to irrelevant or wrong text spans. The results achieved
                                             on the law documents available in the EURLex benchmark show that our approach is superior to both previous transformer-based
                                             approaches and opensource Large Language Models.

                                             Keywords
                                             Legal Artificial Intelligence, Extreme Multi-label Classification, Language Models, Law Documents



                         1. Introduction                                                                                                   anism to attend to the most salient textual units. Since atten-
                                                                                                                                           tion scores do not differentiate between legal and general-
                         The task of eXtreme Multi-label Classification (XMC) aims                                                         purpose textual units, the capabilities of transformers to
                         at assigning to a given text one or more pertinent labels                                                         correctly assign law document categories can be limited,
                         shortlisted from a very large set of classes. Since some of                                                       particularly in zero-shot learning contexts. To overcome this
                         the target classes are likely to be underrepresented or even                                                      issue, we propose to adopt an entity-aware attention mech-
                         absent in the training data, classifiers used for XMC are                                                         anism based on the LUKE transformer [11], which exploits
                         expected to be particularly effective in zero-shot learning                                                       the semantic characteristics of the domain by the means
                         scenarios [1, 2].                                                                                                 of entity embeddings, to enhance zero-shot classification.
                            Transformer-based architectures have shown to be partic-                                                       The key idea is to mainly consider the textual dependencies
                         ularly effective in tackling XMC [1] in various application                                                       with the tokens associated with entities as they are most
                         domains such as e-commerce [3], medical diagnosis [4] and                                                         likely to be discriminating in law document classification.
                         legal AI [5]. This paper focuses on solving the XMC task in a                                                        The experiments carried out on the EURLex benchmark
                         particular legal sub-domain, i.e., the automatic classification                                                   dataset [9] confirm the effectiveness of entity embeddings
                         of law documents.                                                                                                 in enhancing zero-shot XMC performance. Notably, the
                            Legal documents such as laws have peculiar character-                                                          proposed approach not only performs better than existing
                         istics that make the classification task inherently complex.                                                      transformer-based methods but also turns out to be more
                         Firstly, the vocabulary used is very technical and rich of                                                        effective than an opensource Large Language Model with a
                         domain-specific expressions and entities [6]. Secondly, legal                                                     larger number of parameters, i.e., Llama 2 7B [12].
                         documents likely have a peculiar structure making content                                                            The remainder of this work is organized as follows. Sec-
                         retrieval and ranking particularly challenging [7]. Lastly,                                                       tion 2 reviews the existing literature, Section 3 describes
                         the contained text is often verbose as usually contains a lot                                                     the methodology, Section 4 presents the main experimen-
                         of preliminaries or repetitions [8].                                                                              tal results whereas Section 5 draws the conclusions of this
                            Benchmark datasets for law classification such as EU-                                                          work.
                         RLex [5, 9] contain acts and proposals of the European legis-
                         lation. To support their retrieval and exploration law docu-
                         ments are often annotated by Publication Offices with a very                                                      2. Related work
                         large number of labels (e.g., 4,271 labels in EURLex), which
                         encompass frequent labels as well as few- and zero-shot                                                           Legal document classification. The most common case
                         ones. Therefore, automating the process of law document                                                           of document classification in the legal domain is the auto-
                         classification requires the use of accurate XMC models.                                                           matic categorization of court cases, where the goal is to
                            In this paper we aim at overcoming the main limitations                                                        predict the law area of the given case. Existing related
                         of existing transformer-based approaches to law document                                                          works mainly focused on employing machine learning and
                         classification (e.g., [10]), which leverage the attention mech-                                                   deep learning solutions [13, 14, 15, 16]. Parallel studies have
                                                                                                                                           delved into the automatic text classification of legislation to
                          Published in the Proceedings of the Workshops of the EDBT/ICDT 2024                                              discern the law topic, with a particular emphasis on mono-
                          Joint Conference (March 25-28, 2024), Paestum, Italy                                                             lingual datasets [10, 17, 18, 19, 20, 21, 22]. A more limited
                         *
                           Corresponding author.                                                                                           body of work has explored multi-lingual datasets of legis-
                          $ irene.benedetto@{polito.it,maize.io} (I. Benedetto);
                                                                                                                                           lations [9]. Specifically, the work presented in [21] investi-
                          luca.cagliero@polito.it (L. Cagliero); francesco.tarasconi@maize.io
                          (F. Tarasconi)                                                                                                   gates the semantic relationship between each document and
                           0000-0001-7086-7898 (I. Benedetto); 0000-0002-7185-5247                                                        labels. However, their performance on English documents is
                          (L. Cagliero)                                                                                                    limited. Conversely, the transformer-based approaches pro-
                                     © 2024 Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons License
                                     Attribution 4.0 International (CC BY 4.0).



CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
posed in [9, 10, 18] are, to the best of our knowledge, state-of-   classification dataset. It consists of 65k European Union
the-art on English-written law documents. Unlike [9, 10, 18],       (EU) laws annotated with the EUROVOC taxonomy labels.
our work focuses on leveraging entity information in law               The EUROVOC taxonomy is a multilingual classification
classification. To the best of our knowledge, the idea to           and thesaurus system used by the European Union. This
boost the performance of transformer-based approaches to            tool is designed to organize and categorize concepts and
law document classification using entity embeddings has             terms used in official EU documents, facilitating research
not been addressed in literature so far.                            and access to information. Each european act in the EURLEX
                                                                    dataset is associated to one or more EUROVOC concept.
Transformers in Legal Artificial Intelligence.                         Similar to [9] we focused on third level labels. For training
Transformer-based models have demonstrated promising                and test our models we follow the dataset split provided by
results in several areas of legal AI. Specifically, pre-trained     the respective authors.
language models have proved to be effective in tackling
various downstream tasks [18, 23, 24]. Specifically, they           Competitors.      We compare our methodology with:
encompass legal entity recognition [6], legal question
answering [7], and legal document summarization [8].                     • Logistic Regression: A baseline consisting of a
   Language Models have been designed and fine-tuned for                   Term Frequency-Inverse Document Frequency (TF-
the legal domain as well, mainly on Chinese documents. For                 IDF) encoder, counting both local and global frequen-
example, LaWGPT [25] is pre-trained using a large-scale                    cies of occurrence of the input tokens, and a logistic
Chinese legal text database. Lawyer LLaMA [26] is a Chi-                   regression model trained on top of the encoded text.
nese Legal Large Language Model (LLM) that undergoes                     • RoBERTa [9]: builds on BERT [28] removing the
training on a substantial legal dataset. This model is capable             next-sentence pre-training objective and training
of offering legal advice, analyzing legal cases, and generat-              with much larger mini-batches and learning rates;
ing legal articles. ChatLaw [27] comprises a collection of               • LLama 2 7B [12]: a pre-trained Large Language
open-source legal LLMs in Chinese, including models like                   Model with approximately 7 billion parameters that
ChatLaw-13B and ChatLaw-33B. These models are trained                      showcases remarkable performance in both few-shot
on a vast dataset encompassing legal news, forums, and                     and zero-shot scenarios. Analogously to [29], to
judicial interpretations. Existing legal LLMs are suited to                compare with LLMs we treated the XMC task as a
Chinese documents only and are not specifically designed                   generative problem.
to tackle the eXtreme Multi-label Classification task.
                                                                    Experimental setting. We finetuned the base version of
                                                                    LUKE model (studio-ousia/luke-base), for 10 epochs. This
3. Methodology                                                      model was trained with AdamW optimizer [30] with a
                                                                    weight decay of 0.01 and a learning rate of 1e-5. During
In this section, we describe the proposed methodology for
                                                                    training, we applied a 0.1 probability of dropout on classifi-
eXtreme Multi-label Classification (XMC) of law documents.
                                                                    cation layer.
Our purpose is to tackle XMC in a zero-shot setting, i..e.,
                                                                       For the sake of fairness, LLama 2 7B has been trained with
in the absence of ad hoc training examples. To address
                                                                    Parameter-efficient fine-tuning (PEFT) [31], LoRA [32] that
this issue, we propose to recognize and use entity embed-
                                                                    freezes pre-trained model weights and introduces trainable
dings in the document text. Specifically, we leverage the
                                                                    rank decomposition matrices into each layer of the models
pre-trained LUKE model [11] for the classification task by
                                                                    architecture.
replacing the original classification layer with one trained
                                                                       We trained the 8-bit quantized version of this model for a
from scratch on the benchmark dataset. LUKE is a pre-
                                                                    maximum of 3 epochs, with a learning rate of 1.4e-5, LORA
trained contextualized representation of words and entities
                                                                    𝛼 = 16 and 𝑟 = 64.
based on transformer architecture. It produces the contex-
tualized representations of both words and entities thanks
to the entity-aware self-attention mechanism, an extension          Metrics. Here we describe the various metrics used to
of the self-attention mechanism when computing attention            evaluate the performance of the models in our study.
scores.                                                                  • R@5 and P@5: precision and recall at 𝑘 predictions
   Given a sequence of input vectors x1 , x2 , ..., x𝑘 , where             where 𝑘 is equal to 5 in our dataset. It corresponds
x𝑖 ∈ R𝐷 , the attention score 𝑒𝑖𝑗 is computed as follows:                  to the mean number of labels in the training set.

        Kx⊤
      ⎧
      ⎪
      ⎪   𝑗 Qx𝑖 ,              if both x𝑖 and x𝑗 are words
                                                                                         Precision@k =
                                                                                                             TP𝑘
      ⎨Kx⊤ Q x ,
      ⎪
             𝑤2𝑒 𝑖             if x𝑖 is word and x𝑗 is entity                                             TP𝑘 + FP𝑘
          𝑗
𝑒𝑖𝑗 =     ⊤
      ⎪ Kx𝑗 Q𝑒2𝑤 x𝑖 ,          if x𝑖 is entity and x𝑗 is word                                              TP𝑘
      ⎪
      ⎪                                                                                   Recall@k =
      ⎩   ⊤
        Kx𝑗 Q𝑒2𝑒 x𝑖 ,          if both x𝑖 and x𝑗 are entities                                           TP𝑘 + FN𝑘
                                                                         • mRP: for each document, the metric ranks the la-
where Q𝑤2𝑒 , Q𝑒2𝑤 , Q𝑒2𝑒 ∈ R𝐿×𝐷 are query matrices,                        bels selected by the model by decreasing confidence,
K ∈ R𝐿×𝐷 is key matrix.                                                    computes Precision@𝑘, where 𝑘 is the document’s
                                                                           number of gold labels, and then averages the results
                                                                           over documents.
4. Experiments
                                                                    Hardware. We conducted all the experiments on a single
Dataset. In our experiments, we consider the English
                                                                    NVidia® Tesla® V100 GPU with 16 GB of memory, running
portion of EURLEX dataset [9], a multi-label legal document
                                                                    on Ubuntu 22.04 LTS.
4.1. Results                                                                Table 2
                                                                            Comparison in zero-shot learning context
Performance comparison with different training
strategies. We conducted experiments with different                                                    R@5       P@5
training procedures in order to test the performance of the                  Logistic Regression       0.001     0.001
proposed methodology and to compare it with that of differ-                  State-of-the-art [9]      0.028     0.006
ent architectures. To this end, we first freezed the 9 attention             Our approach              0.087     0.164
blocks and fine-tune the classification layer to test the good-              LLama 2 7B [12]           0.253     0.056
ness of the hidden representation of our model. Secondly,
we perform an end-to-end evaluation of the proposed model
to fully assess its potential.
                                                                   models, considering the last attention layer1 . We sorted the
      Table 1                                                      results in decreasing order, ranking the tokens according to
      Models comparison                                            the attention given by the model. Then, separately for each
                                                                   class 𝑐 ∈ 𝐶, the Mean Reciprocal Rank (MRR) of model 𝑚𝑖
      Models                                         mRP           with the most frequent 𝑘 tokens of class 𝑐 was computed,
      Logistic Regression                            0.21          i.e. :
      State-of-the-art [9] (first 9 blocks frozen)   0.27
      Our approach (first 9 blocks frozen)           0.33                              MRR𝑚𝑖 ,𝑐,𝑘 = MRR(R𝑎(𝑚𝑖 ),𝑘𝑐 )                        (1)
      State-of-the-art [9] (end-to-end training)     0.67          where R𝑎(𝑚𝑖 ) is the model 𝑚𝑖 attention ranking position of
      Our approach (end-to-end training)             0.68
                                                                   𝑘 most frequent tokens of class 𝑐 ∈ 𝐶.
      LLama 2 7B [12]                                0.65            We then compute the MRR difference between our model
                                                                   and the state-of-the-art model for different values of 𝑘:

   Table 1 reports the overall performance of our model
                                                                                          1 ∑︁
with different training strategies. Our results show that the                MRR𝑘 =              (MRRLUKE,𝑐,𝑘 − MRRSOTA,𝑐,𝑘 )               (2)
                                                                                         |𝐶| 𝑐∈𝐶
proposed method performs better than both the state-of-the-
art model and the Large Language Model Llama 2. Notably,
the model attains superior performance compared to the             where
state-of-the-art competitor even when the first 9 attention                 • MRRLUKE,𝑐,𝑘 is the Mean Reciprocal Rank computed
blocks are kept fixed. This suggests the efficacy of our model                with the LUKE model ranking, for class 𝑐 ∈ 𝐶 con-
in generating highly informative hidden representations                       sidering the 𝐾 most frequent term.
that enhance the classification task.                                       • MRRSOTA,𝑐,𝑘 is the Mean Reciprocal Rank computed
                                                                              with the state-of-the-art model ranking, for class
Zero-shot performance comparison. We conducted a                              𝑐 ∈ 𝐶 considering the 𝐾 most frequent term.
comparative analysis of the performance of our model and
competitors on zero-shot labels (i.e. labels not present in           These values are reported in Figures 1 and 2 which con-
the training set). In this case, we trained all models without     sider the frequent and zero-shot labels, respectively.
employing any freezing of model layers.                               Scores above zero indicate that, on average, our model
  We report the results in Table 2 in terms of Precision@5         is giving more attention to the most frequent terms of the
and Recall@5. Our evaluation focuses on evaluate the               classes. These results reveal that our model is giving more
model’s ability to retrieve all relevant results without any       attention to terms more frequently appear in each class,
knowledge about labels. The number of predictions consid-          especially in correspondence of zero-shot labels, although
ered is always five, in compliance with [20].                      differences decreases while 𝑘 increases.
  Our results indicate that the baseline model performs
poorly in this zero-shot learning context, with very low
scores for both Precision@5 and Recall@5. The state-of-
the-art model exhibits slightly higher scores, but still per-
forms worse than the model proposed in this work. The
proposed method achieves significantly higher Precision@5
and Recall@5 scores, indicating its superiority over the other
two models in this zero-shot learning context. These re-
sults demonstrate the accuracy of our proposed model and
the completeness of the model’s predictions. Interestingly,
LLMs demonstrate superior Recall@5 performance, even
though their overall results are worse.                            Figure 1: Comparison of Mean Reciprocal Rank (MRR) Differ-
                                                                   ences in Token Attention Scores between our proposed model
                                                                   and the state-of-the-art model for various values of 𝑘 computed
Comparison between models’ attention. To further                   considering frequent labels. Positive scores indicate that our
support the efficacy of the entity-aware self-attention mech-      model assigns higher attention to the most frequent terms of
anism for the given task, we examine the attention scores          each class.
obtained by the best overall models according to the results
in Table 1. For each class we compute the mean tokens
attention score assigned by the state-of-the-art and LUKE          1
                                                                       We consider the last attention head because is the closest to the classi-
                                                                       fication layer.
                                                                        bidirectional encoder representations from transform-
                                                                        ers, in: NeurIPS 2019 Workshop on Science Meets
                                                                        Engineering of Deep Learning, 2019.
                                                                    [2] W. Chang, H. Yu, K. Zhong, Y. Yang, I. S.
                                                                        Dhillon,        A modular deep learning approach
                                                                        for extreme multi-label text classification, CoRR
                                                                        abs/1905.02331 (2019). URL: http://arxiv.org/abs/1905.
                                                                        02331. arXiv:1905.02331.
                                                                    [3] R. Agrawal, A. Gupta, Y. Prabhu, M. Varma, Multi-
                                                                        label learning with millions of labels: Recommend-
Figure 2: Comparison of Mean Reciprocal Rank (MRR) Differ-              ing advertiser bid phrases for web pages, in: Pro-
ences in Token Attention Scores between our proposed model              ceedings of the 22nd International Conference on
and the state-of-the-art model for various values of 𝑘 computed         World Wide Web, WWW ’13, Association for Com-
considering only zero-shot labels. Positive scores indicate that        puting Machinery, New York, NY, USA, 2013, p. 13–24.
our model assigns higher attention to the most frequent terms           URL: https://doi.org/10.1145/2488388.2488391. doi:10.
of each class.                                                          1145/2488388.2488391.
                                                                    [4] A. Johnson, T. Pollard, L. Shen, L.-w. Lehman, M. Feng,
                                                                        M. Ghassemi, B. Moody, P. Szolovits, L. Celi, R. Mark,
5. Conclusion and future work                                           Mimic-iii, a freely accessible critical care database,
                                                                        Scientific Data 3 (2016) 160035. doi:10.1038/sdata.
In this paper we explored the use of an entity-aware                    2016.35.
attention-based method to eXtreme Multi-label Classifica-           [5] I. Chalkidis, E. Fergadiotis, P. Malakasiotis, I. An-
tion of law documents. We show that attending to entity-                droutsopoulos, Large-scale multi-label text clas-
related tokens enhances the capability of the transformer to            sification on EU legislation, in: Proceedings of
attend to class-related pieces of text. The proposed method             the 57th Annual Meeting of the Association for
shows performance superior to both state-of-the-art trans-              Computational Linguistics, Association for Compu-
formers and Large Language Models, achieving higher pre-                tational Linguistics, Florence, Italy, 2019, pp. 6314–
cision and recall scores, especially in the most challenging            6322. URL: https://aclanthology.org/P19-1636. doi:10.
zero-shot learning context. The experiments also highlight              18653/v1/P19-1636.
the impact of different training strategies and the effec-          [6] I. Angelidis, I. Chalkidis, M. Koubarakis, Named entity
tiveness of the proposed model in generating informative                recognition, linking and generation for greek legisla-
hidden representations.                                                 tion, in: JURIX, 2018.
   Based on the preliminary results, we envision the follow-        [7] D. Hendrycks, C. Burns, A. Chen, S. Ball, CUAD: an
ing future research directions:                                         expert-annotated NLP dataset for legal contract review,
                                                                        CoRR abs/2103.06268 (2021). URL: https://arxiv.org/
     • Cross-lingual Transfer: We plan to study the mod-                abs/2103.06268. arXiv:2103.06268.
       els’ performance in the zero-shot cross-lingual trans-       [8] D. Jain, M. D. Borah, A. Biswas, Summarization of
       fer scenario for legal text classification in languages          legal documents: Where are we now and the way
       other than English.                                              forward, Computer Science Review 40 (2021) 100388.
     • LLMs Fine-tuning Strategies: Another line of re-                 URL: https://www.sciencedirect.com/science/article/
       search will be the exploration of additional LLM                 pii/S1574013721000289. doi:https://doi.org/10.
       fine-tuning strategies that incorporate hierarchical             1016/j.cosrev.2021.100388.
       clustering [29].                                             [9] I. Chalkidis, M. Fergadiotis, I. Androutsopoulos, Mul-
                                                                        tieurlex – a multi-lingual and multi-label legal docu-
                                                                        ment classification dataset for zero-shot cross-lingual
Acknowledgments                                                         transfer, 2021. URL: https://arxiv.org/abs/2109.00904.
The research leading to these results has been partially sup-           doi:10.48550/ARXIV.2109.00904.
ported by the SmartData@PoliTO Center for Big Data Tech-           [10] I. Chalkidis, E. Fergadiotis, P. Malakasiotis, N. Ale-
nologies. This study was partially carried out within the the           tras, I. Androutsopoulos, Extreme multi-label le-
MICS (Made in Italy – Circular and Sustainable) Extended                gal text classification: A case study in EU legisla-
Partnership and received funding from Next-GenerationEU                 tion, in: Proceedings of the Natural Legal Language
(Italian PNRR – M4 C2, Invest 1.3 – D.D. 1551.11-10-2022,               Processing Workshop 2019, Association for Compu-
PE00000004) and within the FAIR - Future Artificial Intel-              tational Linguistics, Minneapolis, Minnesota, 2019,
ligence Research - and received funding from the Euro-                  pp. 78–87. URL: https://aclanthology.org/W19-2209.
pean Union Next-GenerationEU (PNRR MISSIONE 4 COM-                      doi:10.18653/v1/W19-2209.
PONENTE 2, INVESTIMENTO 1.3 D.D. 1555 11/10/2022,                  [11] I. Yamada, A. Asai, H. Shindo, H. Takeda, Y. Mat-
PE00000013). This paper reflects only the authors’ views                sumoto, Luke: Deep contextualized entity represen-
and opinions, neither the European Union nor the European               tations with entity-aware self-attention, in: EMNLP,
Commission can be considered responsible for them.                      2020.
                                                                   [12] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-
                                                                        hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava,
References                                                              S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen,
                                                                        G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu,
 [1] H.-F. Yu, K. Zhong, I. S. Dhillon, W.-C. Wang, Y. Yang,            B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn,
     X-bert: extreme multi-label text classification using              S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez,
     M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura,                 the 57th Annual Meeting of the Association for
     M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu,           Computational Linguistics, Association for Compu-
     Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog,         tational Linguistics, Florence, Italy, 2019, pp. 1549–
     Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi,        1559. URL: https://aclanthology.org/P19-1150. doi:10.
     A. Schelten, R. Silva, E. M. Smith, R. Subramanian,              18653/v1/P19-1150.
     X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan,     [23] P. Henderson, M. S. Krass, L. Zheng, N. Guha, C. D.
     P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kam-               Manning, D. Jurafsky, D. E. Ho, Pile of law: Learning
     badur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov,           responsible data filtering from the law and a 256GB
     T. Scialom, Llama 2: Open foundation and fine-tuned              open-source legal dataset, in: Thirty-sixth Conference
     chat models, 2023. arXiv:2307.09288.                             on Neural Information Processing Systems Datasets
[13] O. Sulea, M. Zampieri, S. Malmasi, M. Vela, L. P.                and Benchmarks Track, 2022. URL: https://openreview.
     Dinu, J. van Genabith,           Exploring the use of            net/forum?id=3HCT3xfNm9r.
     text classification in the legal domain,          CoRR      [24] S. Paul, A. Mandal, P. Goyal, S. Ghosh, Pre-training
     abs/1710.09306 (2017). URL: http://arxiv.org/abs/1710.           transformers on indian legal text, arXiv preprint
     09306. arXiv:1710.09306.                                         arXiv:2209.06049 (2022). URL: https://arxiv.org/abs/
[14] J. Gao, H. Ning, Z. Han, L. Kong, H. Qi, Legal text              2209.06049.
     classification model based on text statistical features     [25] H. Nguyen, A brief report on lawgpt 1.0: A vir-
     and deep semantic features, in: P. M. 0001, T. M. 0001,          tual legal assistant based on gpt-3, arXiv preprint
     P. Majumder, M. Mitra (Eds.), Working Notes of FIRE              arXiv:2302.05729 (2023).
     2020 - Forum for Information Retrieval Evaluation,          [26] Q. Huang, M. Tao, Z. An, C. Zhang, C. Jiang, Z. Chen,
     Hyderabad, India, December 16-20, 2020, volume 2826              Z. Wu, Y. Feng, Lawyer llama technical report, arXiv
     of CEUR Workshop Proceedings, CEUR-WS.org, 2020,                 preprint arXiv:2305.15062 (2023).
     pp. 35–41. URL: http://ceur-ws.org/Vol-2826/T1-7.pdf.       [27] J. Cui, Z. Li, Y. Yan, B. Chen, L. Yuan, Chatlaw:
[15] H. Chen, L. Wu, J. Chen, W. Lu, J. Ding, A com-                  Open-source legal large language model with inte-
     parative study of automated legal text classification            grated external knowledge bases, arXiv preprint
     using random forests and deep learning, Informa-                 arXiv:2306.16092 (2023).
     tion Processing & Management 59 (2022) 102798.              [28] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT:
     URL: https://www.sciencedirect.com/science/article/              pre-training of deep bidirectional transformers for
     pii/S0306457321002764. doi:https://doi.org/10.                   language understanding, in: J. Burstein, C. Doran,
     1016/j.ipm.2021.102798.                                          T. Solorio (Eds.), Proceedings of the 2019 Conference
[16] A. Aguiar, R. Silveira, V. Pinheiro, V. Furtado, J. A.           of the North American Chapter of the Association for
     Neto, Text classification in legal documents extracted           Computational Linguistics: Human Language Tech-
     from lawsuits in brazilian courts, in: A. Britto, K. Val-        nologies, NAACL-HLT 2019, Minneapolis, MN, USA,
     divia Delgado (Eds.), Intelligent Systems, Springer In-          June 2-7, 2019, Volume 1 (Long and Short Papers),
     ternational Publishing, Cham, 2021, pp. 586–600.                 Association for Computational Linguistics, 2019, pp.
[17] E. Loza Mencía, J. Fürnkranz, Efficient Multilabel Clas-         4171–4186. URL: https://doi.org/10.18653/v1/n19-1423.
     sification Algorithms for Large-Scale Problems in the            doi:10.18653/v1/n19-1423.
     Legal Domain, Springer-Verlag, Berlin, Heidelberg,          [29] T. Jung, J.-K. Kim, S. Lee, D. Kang, Cluster-guided
     2010, p. 192–215.                                                label generation in extreme multi-label classification,
[18] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Ale-           in: EACL 2023, 2023.
     tras, I. Androutsopoulos, LEGAL-BERT: The mup-              [30] I. Loshchilov, F. Hutter, Decoupled weight decay regu-
     pets straight out of law school,          in: Findings           larization, in: International Conference on Learning
     of the Association for Computational Linguistics:                Representations, 2017.
     EMNLP 2020, Association for Computational Lin-              [31] H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang,
     guistics, Online, 2020, pp. 2898–2904. URL: https:               M. Bansal, C. Raffel, Few-shot parameter-efficient fine-
     //aclanthology.org/2020.findings-emnlp.261. doi:10.              tuning is better and cheaper than in-context learning,
     18653/v1/2020.findings-emnlp.261.                                2022. arXiv:2205.05638.
[19] C. Papaloukas, I. Chalkidis, K. Athinaios, D. Pantazi,      [32] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang,
     M. Koubarakis, Multi-granular legal topic classifi-              W. Chen, Lora: Low-rank adaptation of large language
     cation on greek legislation, CoRR abs/2109.15298                 models, CoRR abs/2106.09685 (2021). URL: https://
     (2021).     URL:      https://arxiv.org/abs/2109.15298.          arxiv.org/abs/2106.09685. arXiv:2106.09685.
     arXiv:2109.15298.
[20] I. Chalkidis, M. Fergadiotis, I. Androutsopoulos, Mul-
     tieurlex - A multi-lingual and multi-label legal docu-
     ment classification dataset for zero-shot cross-lingual
     transfer, CoRR abs/2109.00904 (2021). URL: https:
     //arxiv.org/abs/2109.00904. arXiv:2109.00904.
[21] X. Huang, B. Chen, L. Xiao, L. Jing,              Label-
     aware document representation via hybrid attention
     for extreme multi-label text classification, CoRR
     abs/1905.10070 (2019). URL: http://arxiv.org/abs/1905.
     10070. arXiv:1905.10070.
[22] W. Zhao, H. Peng, S. Eger, E. Cambria, M. Yang,
     Towards scalable and reliable capsule networks for
     challenging NLP applications, in: Proceedings of