Multi-Label Classification of Bills from the Italian Senate Andrea De Angelis1 , Vincenzo di Cicco1 , Giovanni Lalle2 , Carlo Marchetti2 and Paolo Merialdo1 1 Roma Tre University, Italy 2 Senato della Repubblica, Italy Abstract The classification of legal texts is usually carried out by domain experts in force at institutions. The classification process is very complex because the reference thesauri are very rich, both in terms of variety of concepts and in terms of numbers. In addition, they often contain very rarely used labels. In this paper we show how to implement a Machine Learning system that can support the domain experts of the Italian Senate, handling infrequently used labels (Zero/Few-shot classification) and making the output of the model explainable to humans. Keywords Legal texts classification, Zero/Few-shot classification, Explainability 1. Introduction A relevant problem faced by legislative Institutions and by Parliaments is the organization of legal texts in ways fostering their accessibility and prompt consultation by MPs, domain experts, journalists, researchers, and citizens throughout the whole legislative and scrutiny activities. One of the key strategies adopted in this context is the classification of acts, and sometimes of their parts, according to some pre-defined Thesaurus. Classification, when available, enables users to access useful functionality such as topic search and topic-based browsing. Classification is generally achieved involving legal domain experts reading each text and associating them Thesaurus labels; however, solutions have been developed that enable the task to be semi-automated using Machine Learning (ML) techniques. Known state-of-the-art approaches are generally specific to one language [5] [10], although there is one study that can support several languages with the same architecture [1]. Also, not all nations or institutions necessarily use the same reference thesaurus. Despite these differences, the thesauri generally share some common features: • they contain a large number of labels (generally thousands); • labels represent a wide variety of topics (e.g., politics, transportation, medicine, etc.); • most labels are used rarely, or even never. AIxPA 2022: 1st Workshop on AI for Public Administration, December 2nd, 2022, Udine, IT $ and.deangelis@hotmail.com (A. De Angelis); vdicicco@os.uniroma3.it (V. di Cicco); giovanni.lalle@senato.it (G. Lalle); carlo.marchetti@senato.it (C. Marchetti); paolo.merialdo@uniroma3.it (P. Merialdo) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) This paper summarizes activities and results of a successful project aiming at the development of a ML-based system for supporting experts of the Italian Senate during the classification of legislative proposals (bills) using the labels contained in TESEO,1 i.e., the official thesaurus designed and maintained by the Italian Senate since 1992. The system developed to support this time-consuming and skill-demanding processes has been designed to satisfy the following requirements and constraints: • Human-in-the-loop. Experts must always be in full control of the labels selected for classi- fying a text, in order to avoid misclassifications. The system must limit to propose labels that must then be selected (or discarded) by users. The ultimate action of determining the set of labels to be associated to a given text must always be performed by humans. • Explainable. The system must provide easy explanations of proposed labels to domain experts, in order to allow them to quickly assess and evaluate every suggestion. • Zero/Few Shot Classification. TESEO is a huge thesaurus comprising thousands of labels that have an extremely varying rate of use: some of them have been associated to thousands of texts, while some have never been used, yet. As a consequence, the system must be able to handle labels for which few, and sometimes even zero, training examples are available. • Upgradable. Bills and laws deals with a highly dynamic range of topics, continuously ab- sorbing new concepts and themes (e.g., COVID-19 pandemic). The system must therefore embed methods to reliably update its knowledge-base from external resources, other than from its annotated data. The contribution of our work can be summarized as follows: • we report our experience in building a bill classification system designed to satisfies the above requirements; • we publicly release a dataset of Italian bills across Italian Parliament terms XIII-XVIII spanning years 1999-2022, each one annotated with relevant labels coming from TESEO. The remainder of this paper is organised as follows: Section 2 summarizes related work; Section 3 describes the dataset we release, and the TESEO thesaurus adopted for the classification; Section 4 deals with the overall architecture deployed; Section 5 presents experimental results, demonstrating the effectiveness of the proposed solution. 2. Related work Applications in LegalAI have been studied intensively, with major contributions in recent years. Zhong et al. provide a survey, categorizing techniques and tasks [16]. They also report some of the main challenges in the implementation of a LegalAI system, among which they mention the importance of having an interpretable model to prevent fairness from being compromised. Several researches address the problem of classification of law texts. Chalkidis et al. study the Large-Scale Multi-Label Text Classification (LMTC) problem in the legal field [5], with 1 https://www.senato.it/tesauro/teseo.html the release of a dataset, EURLEX57k, consisting of English legal documents from different European legislatures. They also study the problem of classifying such documents using labels from EUROVOC, the European Union’s multilingual thesaurus. The main problem they face concerns the size of the thesaurus, which contains thousands of labels and their distribution on legal documents: out of 7k labels, only about 4k have at least one annotated example available; of these, only 52% are used at least 10 times, making the problem challenging from a ML perspective. They test several Deep Learning solutions, including one that uses Zero/Few-shot classification techniques to attempt to address the label distribution problem. Later, the same authors delved into the problem of law text classification with a focus on rare label classification, proposing some ad-hoc models for this setup [4]. These approaches meet our requirements about Zero/Few-shot classification and explainable models, but unfortunately are designed specifically and only for the English language. Avram et al. have proposed PyEurovoc, a tool capable of handling EUROVOC classification in 22 different languages, including Italian [1]. Their solution is based on the use of BERT [8]. This work shows that BERT achieves good performance in several classification task, but does not perform well Zero-shot classification. Also, BERT models are difficult to interpret and it is costly to update, making BERT-based approaches unsuitable for our requirements. Papaloukas et al. address the problem of classifying legal texts written in Greek [10], a language for which there were few ready-to-use available resources. They release a dataset containing legislative documents annotated with thematic topics and experiment with different Deep Learning solutions for the task of multi-class legal topic classification. Other authors study the classification of law texts, delving into a particular language. In 2005 Bartolini et al. presented SALEM, a tool using NLP techniques to assign a type to each law article and to tag parts of the article with entities in the legal world [2]. They use a very small thesaurus consisting of 8 classes representing the types of provisions in the legal text (e.g., whether it is an amendment to a previous text, whether it introduces an obligation that someone must comply with, etc.). Finally, with a focus on the Italian language but on a different task from law text classification, in 2021 Tagarelli & Simeri presented LamBERTa, a novel BERT-based language understanding framework for finding articles of interest out of a legal corpus (the Italian Civil Code) as a response to a query expressed in natural language [11]. 3. Thesaurus and Bills Corpus In this section, first we describe the characteristics of TESEO, the thesaurus of the Italian Senate. Then, we illustrate the bills corpus and how we have organized it in a suitable dataset. 3.1. TESEO: the Thesaurus of the Italian Senate TESEO is a thesaurus created by the Italian Senate to classify Bills. The number of labels may vary over time as some label may be added, as well as some labels may be deprecated and therefore no longer applicable. At the time of writing, TESEO contains 3,398 labels, whose 100 have been deprecated over time. Table 1 Dataset characteristics. avg tokens avg labels articles per article labels per article 28,616 232.38 2,556 2.73 Labels are ordered according to the logical structure of the Universal Decimal Classification2 and are organized in a hierarchy, which aims at modeling a wide variety of concepts, such as sports, medicine, art, penal and civil law, and so on. 3.2. The Bills Corpus Bills of the Italian Legislature are publicly available on the Web3 . However, they are not released in a standard, structured format: depending on the legislature, they are either in HTML or XML, with a loose structure that makes it tricky to extract portions of interest such as the title, the articles and associated TESEO labels. For this reason, the first effort of our work was to organize the corpus of bills in a CSV dataset, which is publicly available online4 . The dataset contains the texts of the articles of the bills from legislatures XIII - XVIII and are obtained by extraction from XML files, if available, or by web scraping from HTML pages. (a) Tokens distribution. (b) Labels distribution. (c) Labels distribution per category. Figure 1: Characteristics of the Corpus of Bills. Table 1 reports the main features of the dataset: number of the available articles, average number of tokens per article, total number of unique labels associated with the article, and average number of labels per article. Figure 1a shows the distribution of tokens; it is worth observing that the range is very wide, varying between a minimum of 2 tokens and a maximum of ∼75k; the latter are outliers that were excluded from the experimental phase. Figure 1b shows the distribution of labels on articles. Observe that most of the labels are used a few times: there are more than 600 labels that were used only once, while less than 10 labels were used 2 https://udcc.org/index.php/site/page?view=about 3 For example, from the websites of the Italian Senate (https://www.senato.it/ric/sddl/nuovaricerca.do?params. legislatura=18, of the Chamber of Deputies https://www.camera.it/leg18/76, of the Official Gazette https://www. gazzettaufficiale.it/ 4 https://github.com/SenatoDellaRepubblica/MultiLabelBillClassification more than 1000 times. In addition, there are about 800 labels from TESEO (25% of the total) that have never been used to classify an article. According to [5], we used 3 categories to define labels: (i) frequent, if they occur more than 50 times in the training set; (ii) few, if they occur minimum 1 and at most 50 times in the training set; (iii) zero, if they never occur in the training set, but occur at least once in the validation or test set. Figure 1c shows the distribution of labels over categories. 4. System Architecture The goal of the system is to help domain experts in the classification of bills texts. Given the size of TESEO and the challenges of manually navigating its hierarchy, the proposed solution aims to suggest a shortlist of the most relevant labels for a selected article of a bill. Domain experts can interact with the shortlist in a human-in-the-loop fashion, through three different actions: (i) confirming a label, that is, moving a suggested label from the shortlist to the final set of labels; (ii) requiring an explanation, that is, asking for evidence in the text for the suggested label; (iii) adding a label, that is, manually integrating missing labels searching through TESEO. A screenshot of the overall system can be seen in Figure 2. We tackle the problem of building the shortlist in two steps: first, the system estimates the relevance of each TESEO label through a trained multi-label classifier; then the labels are ranked and selected creating the final shortlist. This section describes the components of the system. Section 4.1 describes the model used, explaining why it is suitable for Zero/Few-shot classification, and how it enables explainable predictions. Section 4.2 describes two different word embedding systems with which we experimented, and how they can enable regular updates of the system. Finally, Section 4.3 describes how the labels are selected and ranked to create the final shortlist shown to the domain expert. 4.1. Zero/Few-Shot Multi-Label Classifier We frame the task of estimating the relevance of each TESEO label to a given article, as a multi-label classification problem. The main challenges in building such a component lie in handling the large number of TESEO labels despite having few, or even zero, training samples for the majority of them, as described in Section 3. We build our work on top of ZERO-BIGRU-LWAN, a neural classifier proposed by Chalkidis et al. to handle similar challenges [5]. The main feature of the model is that it learns to classify a main text (e.g., an article text) against the tokens of a short text (e.g., a label descriptor), thus exploiting the semantic meaning of each label. In addition, since a label is a generic short text, a trained model can be used on new, unseen, labels allowing to handle additions to TESEO. First, the model represents each label by encoding their descriptors (i.e., the short textual content of each label), computing the centroid of the Word Embeddings of its tokens: 𝐸 1 ∑︁ 𝑢𝑙 = 𝑤𝑙𝑒 (1) 𝐸 𝑒=1 Figure 2: Screenshot of the system processing the Senate bill n.2495 (https://www.senato.it/leg/18/BGT/ Schede/Ddliter/54699.htm). In this example, the two suggested labels ALBI ELENCHI E REGISTRI and PUBBLICITA’ DI ATTI E DOCUMENTI are confirmed, while the remaining shortlist is shown on the bottom. The domain expert can manually add new labels using the TESEO button, or require an explanation of any label by clicking on it. The highlighted keywords show the explanation for the label TELEMATICA. where 𝑤𝑙𝑒 is the Word Embedding associated with the 𝑒-th token of the 𝑙-th descriptor, and 𝐸 is the total number of tokens in the descriptor. Then, it processes the text by applying a Bidirec- tional (Bi-GRU) [13, 12] to the text Word Embeddings, producing context-aware representations ℎ𝑡 for each token. Finally, it compares the text and the descriptor content through an attention layer: 𝑣𝑡 = tanh(𝑊 ℎ𝑡 + 𝑏) (2) exp(𝑣𝑡⊤ 𝑢𝑙 ) 𝑎𝑙𝑡 = ∑︀ ⊤ (3) 𝑡′ exp(𝑣𝑡′ 𝑢𝑙 ) 𝑇 ∑︁ 𝑑𝑙 = 𝑎𝑙𝑡 𝑣𝑡 (4) 𝑡=1 where 𝑣𝑡 represents the context-aware embeddings processed by a feed-forward layer, 𝑎𝑙𝑡 is the attention score of the 𝑙-th descriptor and the 𝑡-th text token, 𝑇 is the text length, 𝑑𝑙 is the final representation of the text for the 𝑙-th descriptor. The final descriptor probability, used as relevance score, is computed as: 𝑝𝑙 = sigmoid(𝑢⊤ 𝑙 𝑑𝑙 ) (5) Training ZERO-BIGRU-LWAN requires that Word Embeddings are kept frozen, leading to a similar representation for 𝑢𝑙 for seen and unseen (zero-shot) labels descriptor, giving the model a mechanism to handle Zero/Few-shot classification exploiting labels textual content. Moreover, by using only a single attention layer, it enables a visualization of the most important keywords in the text linked to label, providing an explanation to assist Domain Experts. Additionally, Word Embeddings, provides a simple and cheap mechanism to regularly upgrade the system by adding new prior knowledge from unlabeled data sources, without expensive pre-training tasks such as the one used by BERT. 4.2. Word Embeddings ZERO-BIGRU-LWAN requires pre-trained Word Embeddings to provide semantic represen- tations for each token of the article and for the ones inside the label descriptor. In addition, given the wide range of concepts in TESEO and the continuous evolution of legislation texts (for example, consider the word “COVID-19”, which appeared from February 2020 onward), choosing which Word Embeddings to use is an opportunity to add useful prior knowledge to the system. Specialized Word Embeddings for the legal domain have been proposed (e.g., Law2Vec[7]), but they require extensive data collection from many legal sources of the same language; also, specialized knowledge does not perform better than general prior knowledge from sources such as Wikipedia [1]. For this reason, our system uses word vectors learned from plain Wikipedia, taking advantage of its extensive repository of knowledge, its availability in many languages, and its monthly dump release that enables recurring updates. We experimented with two Word Embeddings models that operate at different levels of word representation: fastText [3] and Wikipedia2Vec [15]. The former represents words by learning sub-word embeddings (i.e., characters n-grams) through an extension of the skip-gram model [9], thus it is able to model morphological information. The latter, on the other hand, emphasizes semantic information by learning both words and entity embeddings, using a loss function with three components: (1) a word-based skip-gram model, (2) a knowledge base graph model that learns entity vectors from Wikipedia’s hyperlink graph, and (3) an anchor context model that learns to predict words related to a given entity using anchors and their words, encouraging both types of embeddings to lie in the same 𝑑-dimensional vector space. Our final system uses fastText word vectors, learned by means of the efficient open-source library.5 4.3. Shortlist Creation Each label, with its estimated probability, is further processed by a selection component that outputs the final shortlist shown to the domain expert. We experimented with two selection strategies. First, with a simple top-𝑘, which selects the best 𝑘 labels according to the model, observing low performance for the group of zero labels. Then, with a custom strategy, named Ratio Threshold Strategy (RTS), which attempts to replace unlikely frequent/few labels with promising zero ones, obtaining more balanced results, as we report in Section 5.1. With RTS we first select the top-𝑘 frequent/few labels creating an initial shortlist 𝑆. Then, we create a second shortlist 𝑍 with the top-𝑚 zero labels. Finally, we attempt to replace the 5 https://fasttext.cc/ 𝑝𝑆 tail of 𝑆 (e.g: 𝑆𝑘 ) with the head of 𝑍 (e.g: 𝑍0 ). We accept a replacement if 𝑝𝑍𝑘 ≤ 𝑡, effectively 0 boosting likely zero labels over unlikely few/frequent ones. The replacement continues as long as the condition holds (up to 𝑆𝑘−𝑚 and 𝑍𝑚 ). The resulting shortlist 𝑆, containing 𝑘 labels of which up to 𝑚 zero, is then displayed to the domain expert. It is worth observing that we hide label probabilities to eliminate potential biases in experts decision. By a preliminary user study, we found that 𝑘 = 15, 𝑚 = 3 and 𝑡 = 40 provide a good balance between the effort of the users and the overall performance. 5. Experiments This section reports the main results of the experimental activity that we have conducted to evaluate the system. Training and evaluation setup To train and test ZERO-BIGRU-LWAN, we partition both datasets into three training-validation-test sets following an 80%-10%-10% partitioning. We consider all the labels without training examples as zero labels, following the same approach as [5], obtaining the zero/few/frequent distribution shown in Figure 1c. We process both text and label descriptors, making them lowercase while keeping only alphanumeric characters. Additionally, to avoid affecting the centroid representation of each label with non-qualifying words, we remove the most common Italian stopwords from label descriptors. Also, both Wikipedia2Vec and FastText have been trained on a recent dump of the Italian Wikipedia downloaded6 in July 2022. For training we use word embeddings with 300 dimensions, the Adam [18] optimizer with a learning rate of 0.001 and batch size of 16, and employ early stopping on the validation loss to reduce overfitting. We evaluate the system based on the number of true labels retrieved for a given text, regardless of their rank. Following the same argument of [5], we avoided using Precision@K and Recall@K, which can under- or over-estimate performance when 𝐾 differs from the actual number of true labels (∼ 3 in our case, as shown in Table 1). For this reason, we report the R-Precision@K (RP@K) of the model that achieved the best loss on the validation set: 𝑁 1 ∑︁ 𝑆𝑛 (𝐾) RP@K = 𝑁 min(𝐾, 𝑅𝑛 ) 𝑛=1 Here, 𝑁 is the total number of articles in the test set, 𝑅𝑛 is the number of true labels for the 𝑛-th article, and 𝑆𝑛 (𝐾) is the number of true labels retrieved in the shortlist 𝑆 of size 𝐾 for the current article. It is worth observing that previous works reports the RP@K of each label group, evaluated against the top-𝑘 labels in each group in isolation (i.e., zero labels are evaluated by considering only the top-𝑘 zero labels, ignoring the frequent and few ones), as well as an overall RP@K that considers all the top-𝑘 labels by the model. Since the overall RP@K is strongly influenced by the frequency of each label (i.e., the performance of frequent labels is weighted more), it does not provide a clear picture of how often the system is able to place not frequent labels in the final shortlist. 6 https://dumps.wikimedia.org/itwiki/20220701/ Table 2 Evaluation results of the baseline and ZERO-BIGRU-LWAN, with fastText and Wikipedia2Vec, testing with top-15 and Ratio Treshold Strategy. All results are expressed as RP@15 on the test set. Word Method Overall Frequent Few Zero Embeddings Baseline-top15 / 0.444 0.413 0.456 0.450 ZERO-BIGRU-LWAN-top15 ft 0.751 0.833 0.589 0.050 ZERO-BIGRU-LWAN-top15 w2k 0.742 0.828 0.566 0.075 ZERO-BIGRU-LWAN-RTS w2k 0.734 0.821 0.552 0.530 ZERO-BIGRU-LWAN-RTS ft 0.738 0.823 0.565 0.525 Therefore we report the RP@K of each label group always evaluating against the final shortlist produced and shown to the domain expert by the system. Baseline We compare the results obtained by ZERO-BIGRU-LWAN with baseline that first encodes both article text and label descriptors using TF-IDF. Then, it ranks each label based on the cosine similarity between the article and the label representation. The final rank is used to select the labels for the shortlist. We chose this baseline since it is similar, in spirit, to what the attention layer does in ZERO- BIGRU-LWAN, namely “comparing" a label and an article by their semantic representation. Furthermore, it provides information about the difficulty of the task by showing how many classifications can be made by just retrieving labels with similar content. 5.1. System evaluation Table 2 summarizes the results of our experiment of training the ZERO-BIGRU-LWAN classifier with Wikipedia2Vec and fastText, creating the shortlist with both a top-𝑘 strategy and our proposed Ratio Threshold Strategy (RTS). We report the RP@K obtained on the test set, using a shortlist of size K=15 as this number has been agreed with Domain Experts to be comfortable for a quick overview. As explained in Section 5, we report the RP@15 of each label group always considering the full final shortlist shown to the Domain Expert. From the table, we can see how the baseline behaves using a top-15 strategy. Despite the results are not high, they clearly show that matching label descriptors with the text content is a good signal for this type of classification. Furthermore, we expected balanced results across label groups since the distinction of frequent/few/zero relates to training data and this baseline is non-learning. ZERO-BIGRU-LWAN, both with Wikipedia2Vec and fastText, as expected achieved good RP@15 performance. Even considering the group of labels with few training examples, it beats the baseline by exploiting other than label descriptor content the training data. On the other hand, we found that it struggles to put extremely rare labels (zero) in the final shortlist when used with a naive top-15 strategy. Driven by the results of [5], that reports good RP@5 when considering just zero labels on a similar dataset, we hypothesized that there may be a useful signal in the rank of just zero labels to create a better shortlist. For this reason, our custom Figure 3: RP@15 of ZERO-BIGRU-LWAN with RTS fixing 𝑚 = 3 while varying 𝑡 (Ratio Threshold). We selected the point 𝑡 = 40 as it maximize the performance on the zero labels on the validation set, with minimal loss on other groups. Ratio Threshold Strategy described in Section 4.3, attempts to replace unlikely frequent/few labels with the top-𝑚 zero ones, by boosting the probability of zero labels of a factor 𝑡. We chose 𝑚 = 3 to limit the maximum amount of zero labels in the shortlist to the average number of labels per article (see Table 1) and found 𝑡 = 40 as the first point that maximizes zero performance on the validation set (see Figure 3). The last two rows of the Table 2 shows ZERO-BIGRU-LWAN using RTS with 𝑘 = 15, 𝑚 = 3, and 𝑡 = 40. As can be seen, the proposed strategy is able to greatly increase the performance on the group of zero labels with minimal loss for the frequent and few groups. Despite Wikipedia2Vec and fastText performance being similar, we chose fastText for our final system since it is slightly better at handling frequent and few labels that represent the majority of TESEO. 5.2. Reliability attention score as explanation As described in Section 4.1, the attention scores calculated by the model can be used to provide an explanation to the domain experts. The underlying assumptio is that there is a causal relationship between the predicted label and the 𝑛 tokens with higher attention scores. To test such an assumption, we performed the following experiment. First, we randomly selected 6000 law articles from our dataset (about 30% of the total). For each of these we made Table 3 Attention table. % drop % drop avg positions avg positions % n in probability in probability in rank in rank stable (best n) (random n) (best n) (random n) labels 1 53% 2% -91 -11 89% 3 58% 0.5% -132 -9 81% 5 57% 0.4% -146 -12 76% 10 56% 0.6% -166 -12 69% predictions using the model (case A). Then, for each law article, we randomly selected a label from the top-5 predictions. We repeated the model’s predictions by changing the text of the law article: by removing the best n tokens according to the attention mechanism (case B), by removing n random tokens (case C). Finally, we compared case A to cases B and C, by computing, respectively: • how much the model-estimated probability changed for the selected label; • the difference in the shortlist position of the selected label. The results of this experiment are shown in table 3. Removing the best 𝑛 tokens according to attention has a significantly greater impact on the metrics described above than removing n random tokens. For example, with 𝑛 = 5 the model-estimated probability for the selected labels drops by about 57% on average, whereas by removing 5 random tokens the drop is 0.4%. For the ranked list, by removing the best 5 tokens according to attention, labels lose on average 146 positions, while by removing 5 random tokens the label loses 12. It is also interesting to note that the removal of 𝑛 tokens from the text has a rather cir- cumscribed impact on the label being analyzed. In fact, the last column of table 3 shows the percentage of labels in common between the shortlist obtained in case A and that obtained in case B: for example, by removing the top 5 tokens according to attention, the newly obtained rank contains on average 76% of the labels it also contained previously. In conclusion, label prediction by the model appears to be strongly correlated with tokens for which there is a higher attention score. Therefore, we can conclude that highlighting the best 𝑛 tokens in the text of the article according to attention, in relation to a label, is equivalent to providing an explanation. 5.3. The importance of upgrading Word Embeddings Although most bills that are written aim to regulate “classic" aspects of our lives, such as criminal law, religion, etc., there are special cases in which a new law may refer to a concept or aspect of life that had never been observed in the past; this new law, therefore, may be written with the use of new words that have never been used previously. For example, think of a technological advancement that needs to be regulated, such as the blockchain, or the outbreak of a pandemic caused by a new disease, such as COVID-19: the word “blockchain" first appears in an Italian bill in early 2020, while the Wikipedia page for it, in the Italian version, dates back to March 2016; similarly, before 2020 the word “covid-19" did not exist. For this reason, it is crucial to use a frequently updated external knowledge-base for such a classification system. The ability to periodically retrain the word embeddings used is of crucial importance to keep track of changes that, while rare, may occur in our society. An interesting example that certifies the importance of using word embeddings trained on an up-to-date knowledge-base is given in the table4, in which the text of a law article discussing the Long Covid syndrome is shown. The 5 highlighted words are those with the highest atten- tion scores (in round brackets) for predicting the label VACCINAZIONI OBBLIGATORIE (i.e., “mandatory vaccination") using the ZERO-BIGRU-LWAN model, respectively, with fastText word embeddings trained on a 2022 Wikipedia dump (on the left) and a 2017 dump (on the right). Art. 1. 1. Al fine di garantire la(0.04) presa in carico delle persone affette(0.06) da sindrome(0.03) Art. 1. 1. Al fine di garantire la presa in carico delle persone affette(0.14) da sindrome Long COVID(0.44) , condizione clinica caratterizzata dal mancato ritorno da parte del paziente Long(0.13) COVID, condizione clinica caratterizzata dal mancato ritorno da parte del paziente affetto da COVID(0.18) -19 allo stato di salute precedente l’infezione acuta, le regioni e le province affetto da COVID-19 allo stato di salute precedente l’ infezione(0.14) acuta(0.05) , le regioni e le province autonome di Trento e di Bolzano istituiscono, presso le aziende sanitarie, appositi centri. autonome di Trento e di Bolzano istituiscono, presso le aziende sanitarie(0.10) , appositi centri. Figure 4: Comparison between top-5 attention tokens for the label VACCINAZIONI OBBLIGATORIE using ZERO-BIGRU-LWAN trained with fastText 2022 (left) and fastText 2017 (right). As can be seen, the model with the most recent knowledge-base “knows" the concept “covid" and manages to tie it to the label VACCINAZIONI OBBLIGATORIE, while the other model does not. In fact, the first model succeeds in suggesting the label VACCINAZIONI OBBLIGATORIE in the shortlist, while the other model does not. 6. Future Work In the future we want to make the tool more affordable for non-practitioners, so that they can also use it to better understand the content of a law text. On a technical side, instead, we would like to: • include further baseline approaches in the experiments; • exploit incremental training of word embeddings [17] as an alternative to retraining on all the data, as a possible way to optimize this phase; • take into account the labels hierarchy to improve the classification, as suggested by [4]; Finally, we want to extend the work so that it can also classify a Bill from the title alone, as this is another interesting use case for the Italian Senate. In addition, we want to integrate EUROVOC in our solution, meeting the standards dictated by the European Union. Acknowledgement We would like to thank Manuela Ruisi and Patrizia Toti, the domain experts of the Italian Senate who helped us in the realization of this project, as well as we thank all the annotators who have been involved over the years. References [1] Andrei-Marius Avram, Vasile Pais, and Dan Ioan Tufis. 2021. PyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021). [2] Bartolini, R., Lenci, A., Montemagni, S., Pirrelli, V. and Soria, C., 2004, October. Automatic classification and analysis of provisions in italian legal texts: a case study. In OTM Confed- erated International Conferences" On the Move to Meaningful Internet Systems" (pages 593-604. [3] Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T., 2017. Enriching word vectors with subword information. Transactions of the association for computational linguistics, 5, pages 135-146. [4] Chalkidis, I., Fergadiotis, M., Kotitsas, S., Malakasiotis, P., Aletras, N. and Androutsopoulos, I., 2020, November. An Empirical Study on Large-Scale Multi-Label Text Classification Including Few and Zero-Shot Labels. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7503-7515. [5] Ilias Chalkidis, Emmanouil Fergadiotis, Prodromos Malakasiotis, and Ion Androutsopoulos. 2019. Large-Scale Multi-Label Text Classification on EU Legislation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6314–6322. [6] Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law School. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898–2904. [7] Chalkidis Ilias and Dimitrios Kampas, 2019. Deep learning in law: early adaptation and legal word embeddings trained on large corpora. Artificial Intelligence and Law, 27(2), pages 171-198. [8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT. 2019, pages 4171-4186 [9] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, Jeff Dean, 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26. [10] Christos Papaloukas, Ilias Chalkidis, Konstantinos Athinaios, Despina-Athanasia Pantazi, Manolis Koubarakis, 2021. Multi-granular Legal Topic Classification on Greek Legislation. arXiv preprint arXiv:2109.15298. [11] Andrea Tagarelli, Andrea Simeri, 2021. Unsupervised law article mining based on deep pre- trained language representation models with application to the Italian civil code. Artificial Intelligence and Law, pages 1-57. [12] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734. [13] Mike Schuster, Kuldip K. Paliwal, 1997. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11), pages 2673-2681. [14] Ikuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Yuji Matsumoto, 2020. Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 23-30. [15] Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, 2016. Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 250-259. [16] Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5218–5230, Online. Association for Computational Linguistics. [17] Nobuhiro Kaji and Hayato Kobayashi. 2017. Incremental Skip-gram Model with Nega- tive Sampling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 363–371. [18] Diederik P. Kingma and Jimmy Ba, 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.