1. Introduction

Towards Meaningful Paragraph Embeddings for Data-Scarce Domains: A Case Study in the Legal Domain

Elize Herrewijnen

Dennis F W Craandijk

0 0 National Police Lab AI, Utrecht University , Utrecht , The Netherlands

Creating meaningful text embeddings using BERT-based language models involves pre-training on large amounts of data. For domain-specific use cases where data is scarce (e.g., the law enforcement domain) it might not be feasible to pre-train a whole new language model. In this paper, we examine how extending BERT-based tokenizers and further pre-training BERT-based models can benefit downstream classification tasks. As a proxy for domain-specific data, we use the European Convention of Human Rights (ECtHR) dataset. We find that for down-stream tasks, further pre-training a language model on a small domain dataset can rival models that are completely retrained on large domain datasets. This indicates that completely retraining a language model may not be necessary to improve down-stream task performance. Instead, small adaptions to existing state-of-the-art language models like BERT may sufice.

eol>Transformers BERT Language Models Legal Text Classification ECtHR dataset Text Embeddings

1. Introduction 2. Related work 2.3. Further pre-training language models

Since the introduction of BERT, many domain-specific Proceedings of the Sixth Workshop on Automated Semantic Analysis of language models have been put on the market, for examInformation in Legal Text (ASAIL 2023), June 23, 2023, Braga, Portugal. ple in the clinical [ 12 ], financial [ 13 ], biomedical [ 14 ], and * Corresponding author. legal [ 6, 15 ] domain. Using embeddings from domain($D.eF..hWer.rCewraiajnnednij@k)uu.nl (E. Herrewijnen); d.f.w.craandijk@uu.nl specific language models has a positive efect on the per© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License formance of various downstream-task NLP models, beCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) cause the text embeddings contain more domain-specific 1https://github.com/UtrechtUniversity/ information.

Meaningful-Paragraph-Embeddings-for-Data-Scarce-Domains Creating meaningful text embeddings requires multiple steps: first, a tokenizer model tokenizes the text. This tokenization is used by an encoder model to create an embedding. Finally, this embedding can be used by a predictor model to perform a downstream task. We now describe how the tokenizer, language model, and predic- 3.2. Language models tor can be modified to achieve meaningful embeddings in scarce-data domains.

As baselines for our analysis, we select four BERT -based language models that have shown their applicability to

NLP in the legal domain.

For the legal domain, Limsopatham [ 16 ] compare the newly pre-trained models by Chalkidis et al. [ 6 ] and Zheng et al. [ 15 ] and find that both legal domain-specific models outperform generic language models like BERT.

However, these models inadequately encode long legal texts, as parts of the inputs are truncated to fit into the language model.

In the clinical domain, Lamproudis et al. [ 17 ] show that further pre-trained BERT models on in-domain data outperform generic BERT models, after a single training epoch. In this paper, we investigate whether this also applies to the ECtHR dataset, which is representative for the legal domain.

3. Methods

Art. Name 6 Right to a fair trial P1-1 Protection of property 5 Right to liberty and security 3 Prohibition of torture 13 Right to an efective remedy 8 Right to respect for private and family life 2 Right to life 10 Freedom of expression 14 Prohibition of discrimination 11 Freedom of assembly and association (Other articles)

3.1. Dataset

The European Court of Human Rights (ECtHR) handles BERT-ML The BERT base multilingual cased (BERTalleged violations of European Convention of Human ML) [ 9 ] is a multi-language model pre-trained on the top Rights (ECHR) articles.2 We use this dataset as a proxy 104 languages with the largest Wikipedia corpus. It is a for law enforcement datasets, as these datasets often con- powerful model for capturing generic text data, and can sist of long texts with domain-jargon in our experience. efectively be fine-tuned for downstream tasks [ 19 ]. The ECtHR dataset as introduced by Chalkidis et al. [ 18 ] contains 11k legal cases, containing facts (a list of para- LEGAL-BERT The LEGAL-BERT model is trained graphs representing the facts of the case such as events), from scratch using the same approach as BERT, but on allegedly violated articles, violated articles, and silver 12 GB English legal texts (e.g., legislation, court cases, allegation rationales (relevant facts identified using a reg- contracts) from publicly available sources [ 6 ]. This model ular expression) and gold allegation rationales (relevant outperforms the BERT model when fine-tuned for legal facts annotated by a legal expert). classification tasks [ 16 ].

To further pre-train our language model, we use all facts in training split as used by Chalkidis et al. [ 18 ], RoBERTa The RoBERTa model by Liu et al. [ 20 ] is a further split into a total of 588090 sentences. For our version of BERT, that is trained on a much larger (x10) down-stream task, we use the violated articles as labels, English language corpus using a dynamic masking techresulting in a multi-label classification task. Due to the nique. This allows the model to produce more robust class imbalance in the dataset, we only retain the 10 most and generalizable embeddings, outperforming BERT on common classes (see Table 1), and adopt the same train, various NLP tasks [ 20 ]. dev, and test splits as Chalkidis et al. [ 18 ] for training the classification model. As shown in Table 1, article types vary in number of facts and number of characters, which we statistically tested as significant using a Two-Sample t-Test.

Longformer The Longformer model by Beltagy et al. [ 21 ] builds on RoBERTa, but expands the max input length to 4096 tokens. The model is further pre-trained on large generic texts like news and web pages, and outperforms RoBERTa on long document NLP tasks [ 21 ]. Note that the increased max input length renders the model more resource-expensive. 2See https://www.echr.coe.int/Documents/Convention_ENG.pdf for an extensive description of the convention.

3.3. Tokenizer

tokenizer vocabularies, respectively.

3.4. Encoder models

We use the extended tokenizers to further pre-train two encoder models on the ECtHR training set on a machine with 2 50 GB NVIDIA RTX A6000 cards:3 using the script provided by Devlin et al. [ 9 ], we further pre-train the BERT-ML model for 1 epoch with a batch size of 16, which takes approximately 40 minutes. Using the script provided by Beltagy et al. [ 21 ], we convert a RoBERTa model to a Longformer model, and further pre-train the model for 3000 steps with a batch size of 24, which takes approximately 2 days. We will further refer to these further pretrained encoder models as BERT-MLf and Longformerf.

Efective text embeddings begin with the tokenization of the input text. A tokenizer tokenizes a text using a pre-defined vocabulary. If a word is not in the vocabulary, it is distributed across vocabulary tokens (e.g., 3.5. Classification model applicant becomes app, lica, and nt). Due to their architecture, encoder models limit the max input length We employ a convolutional neural network to classify (usually 512 tokens). The tokenizer model should respect the documents: for every fact in the document, an embedthis limit, which usually results in input truncation. How- ding is retrieved using one of the models from 3.2; then, ever, truncation may negatively afect downstream task the list of embeddings is stacked and fed to the network. performance [ 6 ] as information is lost. Thus, a larger The network consists of 3 1-dimensional convolutional vocabulary reduces the number of tokens required to to- layers (768 × 768, kernel-size 1), followed by 3 linear laykenize a text, allowing more information to be captured. ers (768 × 10). Finally, the mean of predictions for all facts While a large vocabulary might seem desirable, it also is taken to compute the final prediction. A benefit of this increases the number of parameters the encoder model stacked approach is that every fact receives an embedhas to learn, negatively afecting training time and mem- ding, retaining more information than creating a single ory requirements. Hence, a tokenizer should be able to embedding for the whole document by concatenating capture as much relevant information as possible while facts. The model is trained using weighted BCE loss and keeping the number of parameters (i.e., the vocabulary) the Adam optimizer, for 15 epochs (no early stopping) manageable. on a machine with 2 25 GB NVIDIA GeForce RTX 3090

While a tokenizer that is specifically trained on do- cards.4 Note that the parameters of the encoder model main data may be able to tokenize domain-specific texts as described in the previous subsection remain frozen. most efectively, it may be unfeasible to train a new tok- Furthermore, the focus of this paper lies on finding the enizer; even when training data are available, the encoder meaningful embeddings, and not on the classification model also needs to be retrained, which is a resource- accuracy of the classification model: we investigate how and time-consuming task. Therefore, extending a tok- well the diferent embeddings allow the classification enizer with domain-specific tokens may be more feasible. model to learn the task.

By adding domain-specific words, these words are not split up during tokenization, which leaves more space 4. Results for other tokens. Moreover, the encoder model might be able to capture information concerning the domain- In the following section, we discuss our results for both specific tokens, allowing more meaningful embeddings. tokenization and classification.

For example, the LEGAL-BERT model (which contains domain-specific tokens) only requires a single token for 4.1. Tokenization the word ’applicants’, while the BERT-ML tokenizer requires the tokens ‘app’, ‘lica’, and ‘nts’. We compare the tokenization result of the tokenizer mod

We select the top 1% most common words in the els introduced in Section 3.2, by tokenizing the complete dataset based on relative frequency using the Scikit-learn ECtHR dataset. Specifically, we note the following: [ 22 ] CountVectorizer, and add only the yet unknown tokens to the BERT-ML and RoBERTa tokenizers. As shown • The mean number of tokens required to tokenize in Table 2, novel words are related to the legal domain, for a document (TD); example ‘applicant’, ‘prosecutor’, ‘detention’ and month 3Note that the training set is only 85 Mb. names. In total, 25 and 9 new words are added to the 4More model training details can be found on the Github page.

I V TD UT mDT ↓ tDT ↓

For all of the above holds that the lower the values, the more eficient the tokenizer is. The results reported in Table 3 show that the LEGAL-BERT tokenizer is most eficient in tokenizing input texts. The tokenizer requires the fewest tokens to tokenize documents, discards the fewest tokens in comparison to other 512-limited tokenizers, while also having the smallest vocabulary. The Longformer models discard the fewest tokens overall, but require more tokens than the LEGAL-BERT tokenizer.

Extending existing tokenizers slightly decreases the number of discarded tokens (average of 2 for both tokenizers). Thus, retraining the tokenizer model decreases the amount of removed information, but may still be insuficient for long documents. .49 .39 .14 .22 .23 .0 .43 .0 .0 .0 f r e m r o f g n o

5. Limitations and future work 4.2. Classification

As the classification task is an unbalanced multi-label problem, we note the F1-scores in Table 4. We focus on the classification model’s ability to identify independent classes, instead of the average F1-score. If the classification model is unable to identify a class (i.e., 1 = 0), we take this as an indication that the embedding does not contain relevant information about that class. Related work has noted that the multi-label classification is dificult to solve [ 18 ]. Our classification performance is also fair, but a clear diference between embeddings is visible: This work mainly focuses on the efect of further pretraining BERT -based language models on limited domainspecific data. As we do not investigate or optimize the pre-training procedure of our BERT models, a highly relevant point for future work is investigating how BERT models can be (more) efectively (further) pre-trained on (scarce) domain-specific data. Furthermore, we used a multilingual BERT model as a starting point, which may negatively afect performance on down-stream tasks.

Another limitation is that the performance of the classification model (Section 4.2) is rather low, which is due to the minimal efort put into the model. Related work (e.g., Chalkidis et al. [ 18 ]) show much higher F1-scores • BERT-MLf embeddings outperform BERT-ML em- using more advanced (and tested) classification models. beddings on most classes, indicating that extend- Moreover, a more throughout error analyses might give ing existing tokenizers and further pre-training insight in the documents that are typically miss-classified by the classification model, and how pre-training the encoder models impacts classification behaviour.

A point of caution is that pre-training a language model like BERT on domain data may introduce domainspecific bias, especially when the domain dataset misrepresents identity groups (e.g., males are over-represented) [ 24 ]. To apply language models like BERT in the law enforcement domain, the possibility of introduced bias should be investigated in future work.

Finally, a limitation is the generalizability of the dataset and tasks; this work only looks at the efect of pretraining on one well-known domain-specific dataset (ECtHR), task (violated article classification). We expect that our findings generalize across other domain-specific datasets and tasks, especially for long texts with domainjargon. Nevertheless, future work is required to further validate this expectation.

6. Conclusion & discussion

In this paper, we investigate the efect of further pretraining large language models on domain-specific data. In order to test this on scarce-domain data, we use the ECtHR dataset as a surrogate (Section 3.1), and further pre-train a BERT-ML and a Longformer language model on this data.

We find that extending tokenizers with domainspecific tokens reduces the number of tokens discarded, albeit slightly (Section 4.1). Retraining a tokenizer results in a much more eficient tokenization result, but also requires more data and retraining an encoder model from scratch, which might be unfeasible. In a data-scarce or resource-scare setting, extending the tokenizer may be a good alternative, as fewer data is required to further pre-train the encoder model.

Embeddings constructed by the original BERT-ML adequately encode legal domain-specific information, but a completely retrained language model may be beneficial for some classification problems (Section 4.2). Moreover, in scarce-data settings, further pre-training BERT -based models using small amounts may be a feasible alternative to training a language model from scratch. In particular, the combination of adding domain-specific tokens to the tokenizer and further pre-training the language model on a small dataset is a promising direction for future research. Whether our findings generalize across other domains and tasks is a question for future work.

[1]

M. V.

Koroteev , Bert: a Review of Applications in Natural Language Processing and Understanding, arXiv preprint arXiv:2103.11943 ( 2021 ).

[2]

Aftan ,

Shah , A Survey on Bert and Its Applications, in: 2023 20th Learning and Technology Conference (L&T) , IEEE, 2023 , pp. 161 - 166 .

[3]

Xia ,

Wu , B. Van Durme , Which BERT ? A Survey Organizing Contextualized Encoders , in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2020 , pp. 7516 - 7533 .

[4]

Ma ,

Xu ,

Wang ,

Nallapati ,

Xiang , Domain Adaptation with Bert-based Domain Classification and Data Selection , in: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019 ), 2019 , pp. 76 - 83 .

[5]

Peng , E. Chersoni,

Y.-Y.

Hsu ,

C.-R.

Huang , Is Domain Adaptation Worth your Investment? Comparing Bert and Finbert on Financial Tasks , in: Proceedings of the Third Workshop on Economics and Natural Language Processing , 2021 , pp. 37 - 44 .

[6]

Chalkidis ,

Fergadiotis ,

Malakasiotis ,

Aletras , I. Androutsopoulos , LEGAL-BERT: The Muppets straight out of Law School, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics , Online, 2020 , pp. 2898 - 2904 .

[7]

Saxena ,

Rethmeier ,

Van Dijck , G. Spanakis, VendorLink: An NLP approach for Identifying & Linking Vendor Migrants & Potential Aliases on Darknet Markets , arXiv preprint arXiv:2305.02763 ( 2023 ).

[8]

B. W.

Hung ,

S. R.

Muramudalige ,

A. P.

Jayasumana ,

Klausen ,

Libretti , E. Moloney,

Renugopalakrishnan , Recognizing Radicalization Indicators in Text Documents using Human-in-the-Loop Information Extraction and NLP Techniques, in: 2019 ieee international symposium on technologies for homeland security (hst) , IEEE, 2019 , pp. 1 - 7 .

[9]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), 2019 , pp. 4171 - 4186 .

[10]

Nayak ,

Timmapathini ,

Ponnalagu ,

V. G.

Venkoparao , Domain Adaptation Challenges of Bert in Tokenization and Sub-word Representations of Out-of-vocabulary Words , in: Proceedings of the First Workshop on Insights from Negative Results in NLP , 2020 , pp. 1 - 5 .

[11]

Benamar ,

Grouin ,

Bothua ,

Vilnat , Evaluating Tokenizers Impact on Oovs Representation with Transformers Models , in: Proceedings of the Thirteenth Language Resources and Evaluation Conference , 2022 , pp. 4193 - 4204 .

[12]

Sushil ,

Suster , W. Daelemans, Are We There Yet? Exploring Clinical Domain Knowledge of Bert Models , in: Proceedings of the 20th Workshop on Biomedical Language Processing , 2021 , pp. 41 - 53 .

[13]

Araci , FinBERT: Financial Sentiment Analysis with Pre-trained Language Models , arXiv preprint arXiv: 1908 . 10063 ( 2019 ).

[14]

Tai ,

H. T.

Kung ,

Dong ,

Comiter ,

C.-F.

Kuo , exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics , Online, 2020 , pp. 1433 - 1439 .

[15]

Zheng ,

Guha ,

B. R.

Anderson ,

Henderson ,

D. E.

Ho , When does Pretraining Help? Assessing Self-supervised Learning for Law and the Casehold dataset of 53,000+ Legal Holdings , in: Proceedings of the eighteenth international conference on artificial intelligence and law , 2021 , pp. 159 - 168 .

[16]

Limsopatham , Efectively Leveraging Bert for Legal Document Classification , in: Proceedings of the Natural Legal Language Processing Workshop 2021 , 2021 , pp. 210 - 216 .

[17]

Lamproudis ,

Henriksson ,

Dalianis , Evaluating Pretraining Strategies for Clinical Bert Models , in: Proceedings of the Thirteenth Language Resources and Evaluation Conference , 2022 , pp. 410 - 416 .

[18]

Chalkidis ,

Fergadiotis ,

Tsarapatsanis ,

Aletras , I. Androutsopoulos ,

Malakasiotis , Paragraph-level Rationale Extraction through Regularization: A case study on European Court of Human Rights Cases, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics , Online, 2021 , pp. 226 - 241 . doi: 10 .18653/v1/ 2021 .naacl-main. 22 .

[19]

Sun ,

Qiu ,

Xu ,

Huang , How to Fine-tune Bert for Text Classification? , in: Chinese Computational Linguistics: 18th China National Conference, CCL 2019 , Kunming, China, October 18-20 , 2019 , Proceedings 18, Springer, 2019 , pp. 194 - 206 .

[20]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer ,

Stoyanov , Roberta:

A Robustly

Optimized Bert Pretraining Approach , arXiv preprint arXiv: 1907 . 11692 ( 2019 ).

[21] I. Beltagy , Matthew E. Peters, Arman Cohan, Longformer: The Long-document Transformer , arXiv: 2004 . 05150 ( 2020 ).

[22]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , E. Duchesnay, Scikit-learn: Machine Learning in Python , Journal of Machine Learning Research 12 ( 2011 ) 2825 - 2830 .

[23]

Lin ,

Bethard ,

Dligach ,

Sadeque , G. Savova,

T. A.

Miller , Does Bert need Domain Adaptation for Clinical Negation Detection? , Journal of the American Medical Informatics Association 27 ( 2020 ) 584 - 591 .

[24]

Elsafoury ,

Katsigiannis ,

Ramzan , On Bias and Fairness in NLP: How to have a fairer text classification? , arXiv preprint arXiv:2305.12829 ( 2023 ).