=Paper=
{{Paper
|id=Vol-3878/10_main_long
|storemode=property
|title=Title Is (Not) All You Need for EuroVoc Multi-Label Classification of European Laws
|pdfUrl=https://ceur-ws.org/Vol-3878/10_main_long.pdf
|volume=Vol-3878
|authors=Lorenzo Bocchi,Alessio Palmero Aprosio
|dblpUrl=https://dblp.org/rec/conf/clic-it/BocchiA24
}}
==Title Is (Not) All You Need for EuroVoc Multi-Label Classification of European Laws==
Title is (Not) All You Need for EuroVoc
Multi-Label Classification of European Laws
Lorenzo Bocchi1,† , Alessio Palmero Aprosio1,*,†
1
University of Trento, Italy
Abstract
Machine Learning and Artificial Intelligence approaches within Public Administration (PA) have grown significantly in
recent years. Specifically, new guidelines from various governments recommend employing the EuroVoc thesaurus for the
classification of documents issued by the PA. In this paper, we explore some methods to perform document classification in
the legal domain, in order to mitigate the length limitation for input texts in BERT models. We first collect data from the
European Union, already tagged with the aforementioned taxonomy. Then we reorder the sentences included in the text,
with the aim of bringing the most informative part of the document in the first part of the text. Results show that the title and
the context are both important, although the order of the text may not. Finally, we release on GitHub both the dataset and the
source code used for the experiments.
Keywords
EuroVoc taxonomy, Sentence reordering, Text classification
1. Introduction ically assigning EuroVoc labels to a document, starting
from the existing approaches in document and text clas-
The presence of Machine Learning and Artificial Intelli- sification, that use pretrained large language models fol-
gence techniques has become almost ubiquitous in many lowed by a fine-tuning phase on a specific task. Un-
fields, from hobbyist projects to industrial and govern- fortunately, these families of language models have an
ment usage. Also inside the Italian Public Administra- intrinsic limit regarding the maximum number of words
tion, there have been efforts to digitize and modernize present in a text (usually 512). In the case of documents
the processes for more than a decade. In particular, some that can be quite large, like legal ones, it is important
documents released by the Italian PA suggest the use of to try and make sure that the key information about a
EuroVoc,1 a multilingual thesaurus developed and main- text is included in the chosen set of words. The previous
tained by the Publications Office of the European Union research deals with this limit by concatenating the title
(EU), that covers a wide range of subjects (law, economics, with the raw text, and then clipping it to the limit.
environment, ...) organized hierarchically. Outside Italy, In some countries (such as Italy, see [3]) the title is
Portuguese [1] and Croatian [2] communities are making usually very well formulated and it is very important
efforts to automatically perform tagging of official reg- to correctly classify a document. On the contrary, the
ulations using EuroVoc. In addition to that, in 2010 the text of a law is usually very redundant, and the most
EU organized in Luxembourg the Eurovoc Conference,2 representative text is often after a notable sequence of
in order to facilitate the comprehension and use of the preambles.
taxonomy. Given these premises, we investigate how the previous
The classification of a document with respect to the approaches work on European laws and apply different
EuroVoc taxonomy has previously been addressed by strategies to create a summarized version of a text by
several studies (see Section 2), since at present the clas- reordering the sentences. The results show that in this
sification of the documentation in the PA is carried out specific case, both the title and the context are important,
manually, a task that can be very expensive in the long and that the best approach in regulations enacted by the
run. European Parliament is to fill the 512-words limit with
In this context, we concentrate our work on automat- as much information as possible.
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, The paper is structured as follows: Section 2 will ex-
Dec 04 — 06, 2024, Pisa, Italy pose the related work; Section 3 describes the data; the
*
Corresponding author. approach and the experiments are described in Section 4;
†
These authors contributed equally. the results are then discussed in Section 5.
$ lorenzo.bocchi@unitn.it (L. Bocchi); a.palmeroaprosio@unitn.it Finally, both the software and the dataset are available
(A. Palmero Aprosio) for download, as described in Section 6.
0000-0002-1484-0882 (A. Palmero Aprosio)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
1
https://bit.ly/eurovoc-ds
2
https://bit.ly/eurovoc-conference
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
2. Related work EUR-Lex are tagged with labels from this level. Each TC
is linked to an MT, which is then part of a specific DO.
There have been a number of studies that explored the The version of EuroVoc used for our studies is 4.17,
classification of European legislation with EuroVoc labels. released on 31st January 2023, containing 7,382 TCs, 127
JRC EuroVoc Indexer [4] is a tool that allows the cate- MTs, and 21 DOs.
gorization of documents with EuroVoc classifiers in 22
languages. The data used is contained in an old dataset
[5] with documents up to 2006. The algorithm used in-
3.3. Dataset collection
volves generating a collection of lemma frequencies and To collect the documents for our task, we built a set of
weights. These frequencies are associated with specific tools written in Python that can be customized to obtain
descriptors, referred to as associates or topic signatures different subsets of the data (year, language, etc.). In total,
in the paper. When classifying a new document, the al- after filtering out the documents not tagged with EuroVoc
gorithm selects the descriptors from the topic signatures or not containing an easy accessible text (for instance, old
that exhibit the highest similarity to the lemma frequency documents only available as scanned PDFs), we collect
list of the new document. around 1.1 million documents in four languages (English,
The research described in [6] explored the usage of Italian, Spanish, French).
Recurrent Neural Networks on extreme multi-label clas- As a subsequent task, we also removed labels that have
sification datasets, including RCV1 [7], Amazon-13K been deprecated by the EuroVoc developers throughout
[8], Wiki-30K and Wiki-500K [9], and an older EUR-Lex the years.4 Following previous work [11], we also remove
dataset from 2007 [10]. labels having less than 10 examples.
In [11] the authors explore the usage of different deep- Finally, by looking at the data, we see that the labelling
learning architectures. Furthermore, the authors also became consistent starting from 2004, while many dep-
released a dataset of 57,000 tagged documents from EUR- recated labels are still present in documents, especially
Lex. previous to 2010. We therefore consider only documents
There are also other monolingual studies on the topic, published in the interval 2010-2022.
that mainly concentrate on Italian [12], Croatian [13], The final dataset will consist of 471,801 documents.
and Portuguese [1]. On average, each law is labelled with 6 EuroVoc concepts.
More recent works on multi-language classification on Table 1 shows some statistics about the dataset used.
EuroVoc are described in Chalkidis et al. [14], Shaheen
et al. [15], and Wang et al. [16].
4. Experiments
3. Dataset In this Section, we describe the experiments performed
on the above-described data.
3.1. EUR-Lex
The primary source for European legislation is EUR-Lex3 , 4.1. Data split
a web portal offering comprehensive access to EU legal To keep our experiments consistent with previous similar
documents. It is available in all 24 official languages of approaches [17], we split the data into train, dev, and test
the European Union and is updated daily by its Publica- sets with an approximate ratio of 80/10/10 in percentage,
tions Office. Most documents on EUR-Lex are manually respectively.
categorized using EuroVoc concepts. In order to make the training reproducible and to avoid
that a single random extraction could be too (un)lucky, we
3.2. EuroVoc repeat the split using three different seeds and a pseudo-
random number generator.
EuroVoc’s hierarchical structure is divided into three Each partition into train/dev/test is done using Itera-
layers: Thesaurus Concept (TC), Micro Thesaurus (MT, tive Stratification [18, 19], in order to preserve the con-
previously known as the “sub-sector” level), and Do- cept balance.
main (DO, previously known as the “main sector” level). Unless differently specified, all the results in the rest
Each layer contains descriptors for documents, cover- of the paper refer to the average of the values obtained
ing a broad range of EU-related subjects such as law, by our experiments on the three splits.
economics, social affairs, and the environment, each at
varying levels of detail. The TC level is the foundational
layer where all key concepts reside, and documents on
3 4
https://eur-lex.europa.eu/ https://bit.ly/eurovoc-handbook
English Italian Spanish French
Total documents 195,236 177,952 178,444 183,068
Documents with text and EuroVoc labels 118,296 117,711 117,882 117,912
Number of EuroVoc labels used before filtering 6,098 6,088 6,098 6,088
Number of EuroVoc labels having less than 10 documents 2,070 2,077 2,070 2,070
Final number of labels 4,028 4,011 4,028 4,018
Removed documents 3 3 3 3
Table 1
Number of documents in English, Italian, Spanish, and French relative to the time interval 2010-2022.
4.2. Methodology 4.4. Pre-processing
Our models are trained using BERT [20] and its deriva- The text of the laws is preprocessed using spaCy,9 a Nat-
tives. ural Language Processing pipeline that can extract infor-
The choice of the best pre-trained model is very impor- mation from texts in 24 languages. In particular, we used
tant for the accuracy of the classification using the model it to perform sentence splitting part-of-speech tagging,
obtained after fine-tuning. In particular, [21] shows that and named-entities recognition, used to extract content
classification tasks over the legal domain obtain better words from the text and perform the selection of the
performance when pre-trained on legal corpora. Never- sentences that are used in the task.
theless, in some preliminary experiments, we have tried
BERT models pre-trained on various datasets (among 4.5. Summarization
them, legal ones of course), but not always the results
award models built from legal texts. Given that the input length for these BERT models is 512
Although the difference was not statistically signif- tokens, while legislative texts are usually longer, summa-
icant, we decided to use these models anyway (from rizing the text by using the most important parts of it to
HuggingFace5 ): make sure it fits in the input was seen as an important
step to follow.
• legal-bert-base-uncased [22], consisting As underlined in the Introduction, the text of a law is
of 12 GB of diverse English legal text from sev- usually very redundant, and its most representative part
eral fields (e.g., legislation, court cases, contracts) is often after a notable sequence of preambles.
scraped from publicly available resources; Since the limit of 512 tokens is very strong if compared
• bert-base-italian-xxl-cased [23], the to the usual length of a legal document, we concentrate
main Italian BERT model, consisting of a recent our summarization effort on reordering the sentences
Wikipedia dump and various texts from the inside a single document so that the most informative
OPUS corpora collection6 and data from the part of the text can be brought to the beginning and
Italian part of the OSCAR corpus;7 therefore included in the first 512 tokens.
• bert-base-spanish-wwm-cased [24], also We use two different approaches to reach the goal:
called BETO, is a BERT model trained on a big TF-IDF and centroid-based. In both cases, we perform
Spanish corpus8 that consists of 3 billion words; training with the sole text reordered and the concatena-
• camembert-base [25], a state-of-the-art lan- tion of the title and the above text.
guage model for French based on the RoBERTa
model [26]. 4.5.1. TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency)
4.3. Basic configurations is a widely used technique in information retrieval and
text mining to quantify the importance of terms in a
The basic configurations consist of using the sole title,
document within a larger collection of documents. It
the sole text, and the concatenation of the title and the
aims to highlight terms that are both frequent within a
text. Note that, apart from some rare outliers, title length
document and relatively rare in the overall collection,
is consistently less than 50 tokens.
thus capturing their discriminative power.
5
The TF-IDF score of a term in a document is calculated
https://huggingface.co/
6
http://opus.nlpl.eu/
by multiplying two factors: the term frequency (TF) and
7
https://traces1.inria.fr/oscar/
8 9
https://bit.ly/big-spanish-corpora https://spacy.io/
the inverse document frequency (IDF). After obtaining the embedding for the sentence, its
Let 𝑡 be the term and 𝑑 the document: score is computed as the cosine similarity between the
𝑓𝑡,𝑑 centroid and the embedding:
tf(𝑡, 𝑑) = ∑︀
𝑡′ ∈𝑑 𝑓𝑡 ,𝑑
′
𝐶 𝑇 · 𝑆𝑗
𝑁 sim(𝐶, 𝑆𝑗 ) = 1 −
idf(𝑡, 𝐷) = log ||𝐶|| × ||𝑆𝑗 ||
1 + |{𝑑 ∈ 𝐷 : 𝑡 ∈ 𝑑}|
By using the previously described approach, every text
where 𝑓𝑡,𝑑 is the frequency of term 𝑡 in document 𝑑, and was converted into a list of ranked sentences, each with
𝑁 = |𝐷| is the number of documents in the set 𝐷. its own score.
Beyond the usual TF-IDF, we also perform a label-
based approach, that considers one document for each
label, by concatenating all the texts belonging to the laws 4.6. Random
having that label. Because of the obtained results (see Section 4.7), we also
Once all the documents have gone through this process, added two configurations that used a random ordering of
the TF-IDF matrix is calculated using TfidfVectorizer the sentences (one concatenated with the title, the other
from the Python package scikit-learn10 over the content one containing only the randomly ordered text).
words (see Section 4.4) of the texts.
After obtaining the TF-IDF matrix, the final step is
to assign a score to each sentence. For each valid base 4.7. Evaluation
form, its score is determined from the TF-IDF matrix The evaluation of our experiments is performed by using
by selecting the highest value within the corresponding the F1 score, macro-averaged so that each label has the
column (which represents a word). These scores are then same weight (this metric awards models that perform
added to a list for each sentence. Once a sentence is better on less-represented labels). Since we are dealing
processed, the maximum or average score is calculated with a multi-label classification task, we have to choose
(“max” and “mean” in the results). This calculated value between considering always the same number 𝐾 of re-
becomes the sentence’s score. The process is repeated sults (𝑃 @𝐾, 𝑅@𝐾, 𝐹 1@𝐾) or keeping only the labels
for all sentences in every document. whose confidence is higher than a particular threshold
(usually between 0 and 1). In our experiments, we chose
4.5.2. Centroid the second approach, since the number of concepts in
each document of the dataset is not constant. Given the
In this approach, described in [27], the centroid of the
evaluation performed on the development set, we set that
word vectors in the text is calculated, then a score is
threshold to 0.5.
assigned to each sentence based on their cosine distance
from the centroid. The closer a sentence is to the centroid,
the higher the score it will receive. In our approach, we 4.8. Results
use fastText [28] for word embeddings. Table 2 shows the results of the different configurations
The words used to compute the centroid are those that in the four languages. The first column contains the de-
have been extracted as content words (see Section 4.4) scription of the experiment, while columns TC, MT, and
and have a TF-IDF higher than a certain threshold 𝑡, DO show the result in terms of Thesaurus Concept (TC),
which in this case was 0.3. The centroid is computed Micro Thesaurus (MT), and Domain (DO), as described
as the mean of the word embeddings of the previously in Section 3.
selected words:
∑︀
𝑤∈𝐷𝑡 𝐸[idx(𝑤)]
𝐶=
|𝐷𝑡 |
5. Discussion
where 𝐷𝑡 is the corpus of words with tfidf(𝑤) > 𝑡. Results show that the best performances are reached
Each sentence in the document gets transformed into a when the title is included in the text (see the rows without
unique embedding representation by averaging the sum “not”) with the exception brought by the simple use of the
of the embedding vectors of each word in the sentence: text without reordering. An interesting outcome is that
the experiment using title+random obtains very good
results when compared to the best configurations.
∑︀
𝑤∈𝑆𝑗 𝐸[idx(𝑤)]
𝑆𝑗 = On the contrary, using random text without the ti-
|𝑆 − 𝑗|
tle, or using the sole title results in a decrease in global
where 𝑆𝑗 is the 𝑗-th sentence in document 𝐷. performance.
10
https://scikit-learn.org
English Italian Spanish French
TC MT DO TC MT DO TC MT DO TC MT DO
basic 0.484 0.729 0.812 0.450 0.709 0.798 0.493 0.732 0.818 0.383 0.666 0.775
basic-not 0.474 0.722 0.808 0.453 0.710 0.799 0.483 0.726 0.811 0.370 0.655 0.765
centroid 0.468 0.720 0.806 0.454 0.710 0.799 0.479 0.719 0.810 0.372 0.658 0.764
centroid-not 0.426 0.692 0.784 0.405 0.673 0.774 0.430 0.687 0.784 0.335 0.627 0.745
title-only 0.432 0.682 0.772 0.407 0.665 0.758 0.444 0.684 0.771 0.320 0.600 0.716
tfidf-max-doc 0.476 0.724 0.811 0.427 0.693 0.788 0.459 0.711 0.804 0.345 0.642 0.754
tfidf-max-lab 0.477 0.728 0.812 0.459 0.711 0.802 0.483 0.724 0.813 0.378 0.660 0.767
tfidf-mean-doc 0.479 0.726 0.812 0.427 0.693 0.786 0.484 0.726 0.812 0.381 0.663 0.774
tfidf-mean-lab 0.481 0.726 0.813 0.428 0.693 0.788 0.485 0.726 0.813 0.338 0.633 0.749
tfidf-max-doc-not 0.427 0.692 0.787 0.379 0.657 0.763 0.422 0.682 0.786 0.301 0.607 0.726
tfidf-max-lab-not 0.433 0.696 0.791 0.411 0.678 0.779 0.425 0.685 0.782 0.298 0.608 0.728
tfidf-mean-doc-not 0.433 0.696 0.790 0.415 0.682 0.781 0.442 0.700 0.796 0.332 0.626 0.742
tfidf-mean-lab-not 0.436 0.697 0.792 0.388 0.667 0.771 0.428 0.684 0.784 0.296 0.598 0.723
random 0.472 0.722 0.808 0.423 0.692 0.787 0.482 0.723 0.807 0.372 0.652 0.767
random-not 0.429 0.693 0.788 0.398 0.671 0.774 0.439 0.693 0.778 0.318 0.611 0.724
Table 2
Results of our experiments (macro 𝐹1 ).
By looking at the statistical significance,11 we find out are used to select it.
that we can split, more or less, the experiments into two French results bring significantly lower accuracy: this
big groups: the ones that in the English part of the table is not expected and is probably due to the choice of the
have a DO 𝐹1 above 0.80 and the remaining ones that are BERT pre-trained model.
below 0.79. The exception is the “title-only” configura-
tion, which obtains lower accuracy in all languages and
contrasts with the results obtained in a similar previous 6. Release
work applied to Italian laws [3], where the use of the sole
The source code for all the experiments (from the retrieval
title results in an increase in performance with respect
of the documents to the training of the models), the data
to the concatenation between title and text.
downloaded from EUR-Lex, and the models are available
By listing the documents where EuroVoc labels are
on the project Github page.12
not extracted correctly, it seems that in the European
legislation it is quite common to find very generic ti-
tles. For instance, the title of the document with ID 7. Conclusions and Future Work
“CELEX:32011Q0624(01)” is “Rules of procedure for the
appeal committee (Regulation (EU) No 182/2011)”, from In this paper, we presented some approaches to perform
which is very hard to extract relevant information about document classification on long documents, by reorder-
the topic. One can find other similar documents, such ing their sentences before the fine-tuning phase. The
as “Action brought on 2 March 2011 — Attey v Council”, best results are obtained when all the 512 tokens allowed
title of law with ID “CELEX:62011TN0118”. in the BERT paradigm are filled, possibly including the
In general, our experiments show that the classifica- title of the law.
tion of European laws obtains the best performance on In the future, we want to extend this approach to other
BERT when all the possible tokens are filled, possibly languages, trying to understand whether the same re-
using the title and some parts of the text. The high accu- ordering algorithm leads to some improvement in the
racy obtained in the experiments performed by randomly classification task. We will also investigate other sum-
reordering the sentences demonstrates that the context marization approaches, or new architectures that rely on
is important per se, even when no particular strategies Local, Sparse, and Global attention [29] so that longer
texts (up to 16K tokens) can be used to train the model.
11
To calculate statistical significance, a one-tailed 𝑡-test with a signif-
icance level of .05 was applied to the scores of the five runs, with
the null hypothesis that no difference is observed, and the alterna-
tive hypothesis that the score obtained with the summarized text
12
is significantly greater than the one with the normal text. https://github.com/bocchilorenzo/AutoEuroVoc
References lems in the legal domain, 2010. URL: http://dx.doi.
org/10.1007/978-3-642-12837-0_11. doi:10.1007/
[1] D. Caled, M. Won, B. Martins, M. J. Silva, A hi- 978-3-642-12837-0_11.
erarchical label network for multi-label eurovoc [11] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, I. An-
classification of legislative contents, in: Digi- droutsopoulos, Large-scale multi-label text clas-
tal Libraries for Open Knowledge: 23rd Interna- sification on eu legislation, arXiv preprint
tional Conference on Theory and Practice of Digi- arXiv:1906.02192 (2019).
tal Libraries, TPDL 2019, Oslo, Norway, September [12] G. Boella, L. Di Caro, L. Lesmo, D. Rispoli,
9-12, 2019, Proceedings, Springer-Verlag, Berlin, Multi-label classification of legislative text into
Heidelberg, 2019, p. 238–252. URL: https://doi. eurovoc, Legal Knowledge and Information
org/10.1007/978-3-030-30760-8_21. doi:10.1007/ Systems: JURIX 2012: the Twenty-Fifth An-
978-3-030-30760-8_21. nual Conference 250 (2013) 21. doi:10.3233/
[2] T. D. Prekpalaj, The role of key words and the use 978-1-61499-167-0-21.
of the multilingual eurovoc thesaurus when search- [13] F. Saric, B. D. Basic, M.-F. Moens, J. Šnajder, Multi-
ing for legal regulations of the republic of croatia - label classification of croatian legal documents us-
research results, in: 2021 44th International Con- ing eurovoc thesaurus, 2014.
vention on Information, Communication and Elec- [14] I. Chalkidis, M. Fergadiotis, I. Androutsopoulos,
tronic Technology (MIPRO), 2021, pp. 1470–1475. MultiEURLEX - a multi-lingual and multi-label le-
doi:10.23919/MIPRO52101.2021.9597043. gal document classification dataset for zero-shot
[3] M. Rovera, A. P. Aprosio, F. Greco, M. Lucchese, cross-lingual transfer, in: Proceedings of the
S. Tonelli, A. Antetomaso, Italian legislative text 2021 Conference on Empirical Methods in Natu-
classification for gazzetta ufficiale (2022). ral Language Processing, Association for Compu-
[4] R. Steinberger, M. Ebrahim, M. Turchi, Jrc eurovoc tational Linguistics, Online and Punta Cana, Do-
indexer jex-a freely available multi-label categori- minican Republic, 2021, pp. 6974–6996. URL: https:
sation tool, arXiv preprint arXiv:1309.5223 (2013). //aclanthology.org/2021.emnlp-main.559. doi:10.
[5] R. Steinberger, B. Pouliquen, A. Widiger, C. Ig- 18653/v1/2021.emnlp-main.559.
nat, T. Erjavec, D. Tufiş, D. Varga, The JRC- [15] Z. Shaheen, G. Wohlgenannt, E. Filtz, Large scale
Acquis: A multilingual aligned parallel corpus legal text classification using transformer models,
with 20+ languages, in: Proceedings of the 2020. arXiv:2010.12871.
Fifth International Conference on Language Re- [16] L. Wang, Y. W. Teh, M. A. Al-Garadi, Adopt-
sources and Evaluation (LREC’06), European Lan- ing the multi-answer questioning task with an
guage Resources Association (ELRA), Genoa, Italy, auxiliary metric for extreme multi-label text
2006. URL: http://www.lrec-conf.org/proceedings/ classification utilizing the label hierarchy, 2023.
lrec2006/pdf/340_pdf.pdf. arXiv:2303.01064.
[6] R. You, Z. Zhang, Z. Wang, S. Dai, H. Mamitsuka, [17] A. Avram, V. F. Pais, D. Tufis, Pyeurovoc:
S. Zhu, Attentionxml: Label tree-based attention- A tool for multilingual legal document clas-
aware deep model for high-performance extreme sification with eurovoc descriptors, CoRR
multi-label text classification, Advances in Neural abs/2108.01139 (2021). URL: https://arxiv.org/abs/
Information Processing Systems 32 (2019). 2108.01139. arXiv:2108.01139.
[7] D. D. Lewis, Y. Yang, T. G. Rose, F. Li, Rcv1: A [18] K. Sechidis, G. Tsoumakas, I. Vlahavas, On the
new benchmark collection for text categorization stratification of multi-label data, Machine Learning
research, J. Mach. Learn. Res. 5 (2004) 361–397. and Knowledge Discovery in Databases (2011) 145–
[8] J. McAuley, J. Leskovec, Hidden factors and hid- 158.
den topics: Understanding rating dimensions with [19] P. Szymański, T. Kajdanowicz, A network perspec-
review text, in: Proceedings of the 7th ACM Con- tive on stratification of multi-label data, in: L. Torgo,
ference on Recommender Systems, RecSys ’13, As- B. Krawczyk, P. Branco, N. Moniz (Eds.), Proceed-
sociation for Computing Machinery, New York, ings of the First International Workshop on Learn-
NY, USA, 2013, p. 165–172. URL: https://doi.org/ ing with Imbalanced Domains: Theory and Applica-
10.1145/2507157.2507163. doi:10.1145/2507157. tions, volume 74 of Proceedings of Machine Learning
2507163. Research, PMLR, ECML-PKDD, Skopje, Macedonia,
[9] A. Zubiaga, Enhancing navigation on wikipedia 2017, pp. 22–35.
with social tags, arXiv preprint arXiv:1202.5469 [20] J. Devlin, M. Chang, K. Lee, K. Toutanova,
(2012). BERT: pre-training of deep bidirectional trans-
[10] E. Loza Mencía, J. Fürnkranz, Efficient multil- formers for language understanding, CoRR
abel classification algorithms for large-scale prob- abs/1810.04805 (2018). URL: http://arxiv.org/abs/
1810.04805. arXiv:1810.04805.
[21] I. Chalkidis, E. Fergadiotis, P. Malakasiotis, N. Ale-
tras, I. Androutsopoulos, Extreme multi-label legal
text classification: A case study in EU legislation,
in: Proceedings of the Natural Legal Language Pro-
cessing Workshop 2019, Association for Computa-
tional Linguistics, Minneapolis, Minnesota, 2019,
pp. 78–87. URL: https://aclanthology.org/W19-2209.
doi:10.18653/v1/W19-2209.
[22] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Ale-
tras, I. Androutsopoulos, LEGAL-BERT: The mup-
pets straight out of law school, in: Findings
of the Association for Computational Linguistics:
EMNLP 2020, Association for Computational Lin-
guistics, Online, 2020, pp. 2898–2904. URL: https://
aclanthology.org/2020.findings-emnlp.261. doi:10.
18653/v1/2020.findings-emnlp.261.
[23] S. Schweter, Italian bert and electra models,
2020. URL: https://doi.org/10.5281/zenodo.4263142.
doi:10.5281/zenodo.4263142.
[24] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho,
H. Kang, J. Pérez, Spanish pre-trained bert model
and evaluation data, in: PML4DC at ICLR 2020,
2020.
[25] L. Martin, B. Muller, P. J. O. Suárez, Y. Dupont, L. Ro-
mary, É. V. de la Clergerie, D. Seddah, B. Sagot,
Camembert: a tasty french language model, in:
Proceedings of the 58th Annual Meeting of the As-
sociation for Computational Linguistics, 2020.
[26] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
Roberta: A robustly optimized BERT pretraining
approach, CoRR abs/1907.11692 (2019). URL: http:
//arxiv.org/abs/1907.11692. arXiv:1907.11692.
[27] G. Rossiello, P. Basile, G. Semeraro, Centroid-based
text summarization through compositionality of
word embeddings, in: Proceedings of the multiling
2017 workshop on summarization and summary
evaluation across source types and genres, 2017, pp.
12–21.
[28] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, En-
riching word vectors with subword information,
Transactions of the association for computational
linguistics 5 (2017) 135–146.
[29] C. Condevaux, S. Harispe, Lsg attention: Extrapola-
tion of pretrained transformers to long sequences,
in: Pacific-Asia Conference on Knowledge Discov-
ery and Data Mining, Springer, 2023, pp. 443–454.