<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Meaningful Paragraph Embeddings for Data-Scarce Domains: A Case Study in the Legal Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elize Herrewijnen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dennis F W Craandijk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Police Lab AI, Utrecht University</institution>
          ,
          <addr-line>Utrecht</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Creating meaningful text embeddings using BERT-based language models involves pre-training on large amounts of data. For domain-specific use cases where data is scarce (e.g., the law enforcement domain) it might not be feasible to pre-train a whole new language model. In this paper, we examine how extending BERT-based tokenizers and further pre-training BERT-based models can benefit downstream classification tasks. As a proxy for domain-specific data, we use the European Convention of Human Rights (ECtHR) dataset. We find that for down-stream tasks, further pre-training a language model on a small domain dataset can rival models that are completely retrained on large domain datasets. This indicates that completely retraining a language model may not be necessary to improve down-stream task performance. Instead, small adaptions to existing state-of-the-art language models like BERT may sufice.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Transformers</kwd>
        <kwd>BERT</kwd>
        <kwd>Language Models</kwd>
        <kwd>Legal Text Classification</kwd>
        <kwd>ECtHR dataset</kwd>
        <kwd>Text Embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <sec id="sec-2-1">
        <title>2.3. Further pre-training language models</title>
        <p>
          Since the introduction of BERT, many domain-specific
Proceedings of the Sixth Workshop on Automated Semantic Analysis of language models have been put on the market, for
examInformation in Legal Text (ASAIL 2023), June 23, 2023, Braga, Portugal. ple in the clinical [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], financial [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], biomedical [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], and
* Corresponding author. legal [
          <xref ref-type="bibr" rid="ref15 ref6">6, 15</xref>
          ] domain. Using embeddings from
domain($D.eF..hWer.rCewraiajnnednij@k)uu.nl (E. Herrewijnen); d.f.w.craandijk@uu.nl specific language models has a positive efect on the
per© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License formance of various downstream-task NLP models,
beCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) cause the text embeddings contain more domain-specific
1https://github.com/UtrechtUniversity/ information.
        </p>
        <p>Meaningful-Paragraph-Embeddings-for-Data-Scarce-Domains
Creating meaningful text embeddings requires multiple
steps: first, a tokenizer model tokenizes the text. This
tokenization is used by an encoder model to create an
embedding. Finally, this embedding can be used by a
predictor model to perform a downstream task. We now
describe how the tokenizer, language model, and predic- 3.2. Language models
tor can be modified to achieve meaningful embeddings
in scarce-data domains.</p>
        <p>As baselines for our analysis, we select four BERT -based
language models that have shown their applicability to</p>
        <p>NLP in the legal domain.</p>
        <p>
          For the legal domain, Limsopatham [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] compare the
newly pre-trained models by Chalkidis et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and
Zheng et al. [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and find that both legal domain-specific
models outperform generic language models like BERT.
        </p>
        <p>However, these models inadequately encode long legal
texts, as parts of the inputs are truncated to fit into the
language model.</p>
        <p>
          In the clinical domain, Lamproudis et al. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] show
that further pre-trained BERT models on in-domain data
outperform generic BERT models, after a single training
epoch. In this paper, we investigate whether this also
applies to the ECtHR dataset, which is representative for
the legal domain.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>Art. Name
6 Right to a fair trial
P1-1 Protection of property
5 Right to liberty and
security
3 Prohibition of torture
13 Right to an efective
remedy
8 Right to respect for private
and family life
2 Right to life
10 Freedom of expression
14 Prohibition of
discrimination
11 Freedom of assembly and
association
(Other articles)</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>
          The European Court of Human Rights (ECtHR) handles BERT-ML The BERT base multilingual cased
(BERTalleged violations of European Convention of Human ML) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is a multi-language model pre-trained on the top
Rights (ECHR) articles.2 We use this dataset as a proxy 104 languages with the largest Wikipedia corpus. It is a
for law enforcement datasets, as these datasets often con- powerful model for capturing generic text data, and can
sist of long texts with domain-jargon in our experience. efectively be fine-tuned for downstream tasks [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
The ECtHR dataset as introduced by Chalkidis et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]
contains 11k legal cases, containing facts (a list of para- LEGAL-BERT The LEGAL-BERT model is trained
graphs representing the facts of the case such as events), from scratch using the same approach as BERT, but on
allegedly violated articles, violated articles, and silver 12 GB English legal texts (e.g., legislation, court cases,
allegation rationales (relevant facts identified using a reg- contracts) from publicly available sources [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. This model
ular expression) and gold allegation rationales (relevant outperforms the BERT model when fine-tuned for legal
facts annotated by a legal expert). classification tasks [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
        <p>
          To further pre-train our language model, we use all
facts in training split as used by Chalkidis et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], RoBERTa The RoBERTa model by Liu et al. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] is a
further split into a total of 588090 sentences. For our version of BERT, that is trained on a much larger (x10)
down-stream task, we use the violated articles as labels, English language corpus using a dynamic masking
techresulting in a multi-label classification task. Due to the nique. This allows the model to produce more robust
class imbalance in the dataset, we only retain the 10 most and generalizable embeddings, outperforming BERT on
common classes (see Table 1), and adopt the same train, various NLP tasks [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
dev, and test splits as Chalkidis et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] for training the
classification model. As shown in Table 1, article types
vary in number of facts and number of characters, which
we statistically tested as significant using a Two-Sample
t-Test.
        </p>
        <p>
          Longformer The Longformer model by Beltagy et al.
[
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] builds on RoBERTa, but expands the max input length
to 4096 tokens. The model is further pre-trained on large
generic texts like news and web pages, and outperforms
RoBERTa on long document NLP tasks [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. Note that
the increased max input length renders the model more
resource-expensive.
2See https://www.echr.coe.int/Documents/Convention_ENG.pdf for
an extensive description of the convention.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Tokenizer</title>
        <p>tokenizer vocabularies, respectively.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. Encoder models</title>
        <p>
          We use the extended tokenizers to further pre-train two
encoder models on the ECtHR training set on a machine
with 2 50 GB NVIDIA RTX A6000 cards:3 using the script
provided by Devlin et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], we further pre-train the
BERT-ML model for 1 epoch with a batch size of 16, which
takes approximately 40 minutes. Using the script
provided by Beltagy et al. [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], we convert a RoBERTa model
to a Longformer model, and further pre-train the model
for 3000 steps with a batch size of 24, which takes
approximately 2 days. We will further refer to these further
pretrained encoder models as BERT-MLf and Longformerf.
        </p>
        <p>
          Efective text embeddings begin with the tokenization
of the input text. A tokenizer tokenizes a text using
a pre-defined vocabulary. If a word is not in the
vocabulary, it is distributed across vocabulary tokens (e.g., 3.5. Classification model
applicant becomes app, lica, and nt). Due to their
architecture, encoder models limit the max input length We employ a convolutional neural network to classify
(usually 512 tokens). The tokenizer model should respect the documents: for every fact in the document, an
embedthis limit, which usually results in input truncation. How- ding is retrieved using one of the models from 3.2; then,
ever, truncation may negatively afect downstream task the list of embeddings is stacked and fed to the network.
performance [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] as information is lost. Thus, a larger The network consists of 3 1-dimensional convolutional
vocabulary reduces the number of tokens required to to- layers (768 × 768, kernel-size 1), followed by 3 linear
laykenize a text, allowing more information to be captured. ers (768 × 10). Finally, the mean of predictions for all facts
While a large vocabulary might seem desirable, it also is taken to compute the final prediction. A benefit of this
increases the number of parameters the encoder model stacked approach is that every fact receives an
embedhas to learn, negatively afecting training time and mem- ding, retaining more information than creating a single
ory requirements. Hence, a tokenizer should be able to embedding for the whole document by concatenating
capture as much relevant information as possible while facts. The model is trained using weighted BCE loss and
keeping the number of parameters (i.e., the vocabulary) the Adam optimizer, for 15 epochs (no early stopping)
manageable. on a machine with 2 25 GB NVIDIA GeForce RTX 3090
        </p>
        <p>While a tokenizer that is specifically trained on do- cards.4 Note that the parameters of the encoder model
main data may be able to tokenize domain-specific texts as described in the previous subsection remain frozen.
most efectively, it may be unfeasible to train a new tok- Furthermore, the focus of this paper lies on finding the
enizer; even when training data are available, the encoder meaningful embeddings, and not on the classification
model also needs to be retrained, which is a resource- accuracy of the classification model: we investigate how
and time-consuming task. Therefore, extending a tok- well the diferent embeddings allow the classification
enizer with domain-specific tokens may be more feasible. model to learn the task.</p>
        <p>By adding domain-specific words, these words are not
split up during tokenization, which leaves more space 4. Results
for other tokens. Moreover, the encoder model might
be able to capture information concerning the domain- In the following section, we discuss our results for both
specific tokens, allowing more meaningful embeddings. tokenization and classification.</p>
        <p>For example, the LEGAL-BERT model (which contains
domain-specific tokens) only requires a single token for 4.1. Tokenization
the word ’applicants’, while the BERT-ML tokenizer
requires the tokens ‘app’, ‘lica’, and ‘nts’. We compare the tokenization result of the tokenizer
mod</p>
        <p>
          We select the top 1% most common words in the els introduced in Section 3.2, by tokenizing the complete
dataset based on relative frequency using the Scikit-learn ECtHR dataset. Specifically, we note the following:
[
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] CountVectorizer, and add only the yet unknown
tokens to the BERT-ML and RoBERTa tokenizers. As shown • The mean number of tokens required to tokenize
in Table 2, novel words are related to the legal domain, for a document (TD);
example ‘applicant’, ‘prosecutor’, ‘detention’ and month 3Note that the training set is only 85 Mb.
names. In total, 25 and 9 new words are added to the 4More model training details can be found on the Github page.
        </p>
        <p>I
V
TD
UT
mDT ↓
tDT ↓</p>
        <p>For all of the above holds that the lower the values,
the more eficient the tokenizer is. The results reported
in Table 3 show that the LEGAL-BERT tokenizer is most
eficient in tokenizing input texts. The tokenizer requires
the fewest tokens to tokenize documents, discards the
fewest tokens in comparison to other 512-limited
tokenizers, while also having the smallest vocabulary. The
Longformer models discard the fewest tokens overall,
but require more tokens than the LEGAL-BERT tokenizer.</p>
        <p>Extending existing tokenizers slightly decreases the
number of discarded tokens (average of 2 for both
tokenizers). Thus, retraining the tokenizer model decreases the
amount of removed information, but may still be
insuficient for long documents.
.49
.39
.14
.22
.23
.0
.43
.0
.0
.0
f r
e
m
r
o
f
g
n
o</p>
        <p>L</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Limitations and future work</title>
      <sec id="sec-4-1">
        <title>4.2. Classification</title>
        <p>
          As the classification task is an unbalanced multi-label
problem, we note the F1-scores in Table 4. We focus on
the classification model’s ability to identify independent
classes, instead of the average F1-score. If the
classification model is unable to identify a class (i.e.,  1 = 0), we
take this as an indication that the embedding does not
contain relevant information about that class. Related
work has noted that the multi-label classification is
dificult to solve [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. Our classification performance is also
fair, but a clear diference between embeddings is visible:
This work mainly focuses on the efect of further
pretraining BERT -based language models on limited
domainspecific data. As we do not investigate or optimize the
pre-training procedure of our BERT models, a highly
relevant point for future work is investigating how BERT
models can be (more) efectively (further) pre-trained on
(scarce) domain-specific data. Furthermore, we used a
multilingual BERT model as a starting point, which may
negatively afect performance on down-stream tasks.
        </p>
        <p>
          Another limitation is that the performance of the
classification model (Section 4.2) is rather low, which is due
to the minimal efort put into the model. Related work
(e.g., Chalkidis et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]) show much higher F1-scores
• BERT-MLf embeddings outperform BERT-ML em- using more advanced (and tested) classification models.
beddings on most classes, indicating that extend- Moreover, a more throughout error analyses might give
ing existing tokenizers and further pre-training insight in the documents that are typically miss-classified
by the classification model, and how pre-training the
encoder models impacts classification behaviour.
        </p>
        <p>
          A point of caution is that pre-training a language
model like BERT on domain data may introduce
domainspecific bias, especially when the domain dataset
misrepresents identity groups (e.g., males are over-represented)
[
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. To apply language models like BERT in the law
enforcement domain, the possibility of introduced bias
should be investigated in future work.
        </p>
        <p>Finally, a limitation is the generalizability of the dataset
and tasks; this work only looks at the efect of
pretraining on one well-known domain-specific dataset
(ECtHR), task (violated article classification). We expect
that our findings generalize across other domain-specific
datasets and tasks, especially for long texts with
domainjargon. Nevertheless, future work is required to further
validate this expectation.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusion &amp; discussion</title>
      <p>In this paper, we investigate the efect of further
pretraining large language models on domain-specific data.
In order to test this on scarce-domain data, we use the
ECtHR dataset as a surrogate (Section 3.1), and further
pre-train a BERT-ML and a Longformer language model
on this data.</p>
      <p>We find that extending tokenizers with
domainspecific tokens reduces the number of tokens discarded,
albeit slightly (Section 4.1). Retraining a tokenizer results
in a much more eficient tokenization result, but also
requires more data and retraining an encoder model from
scratch, which might be unfeasible. In a data-scarce or
resource-scare setting, extending the tokenizer may be
a good alternative, as fewer data is required to further
pre-train the encoder model.</p>
      <p>Embeddings constructed by the original BERT-ML
adequately encode legal domain-specific information, but a
completely retrained language model may be beneficial
for some classification problems (Section 4.2). Moreover,
in scarce-data settings, further pre-training BERT -based
models using small amounts may be a feasible alternative
to training a language model from scratch. In particular,
the combination of adding domain-specific tokens to the
tokenizer and further pre-training the language model
on a small dataset is a promising direction for future
research. Whether our findings generalize across other
domains and tasks is a question for future work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Koroteev</surname>
          </string-name>
          ,
          <article-title>Bert: a Review of Applications in Natural Language Processing</article-title>
          and Understanding,
          <source>arXiv preprint arXiv:2103.11943</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Aftan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          <article-title>Survey on Bert and Its Applications, in: 2023 20th Learning</article-title>
          and Technology
          <string-name>
            <surname>Conference (L&amp;T)</surname>
          </string-name>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>161</fpage>
          -
          <lpage>166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Van Durme</given-names>
            ,
            <surname>Which</surname>
          </string-name>
          <string-name>
            <surname>BERT</surname>
          </string-name>
          ?
          <article-title>A Survey Organizing Contextualized Encoders</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>7516</fpage>
          -
          <lpage>7533</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nallapati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <article-title>Domain Adaptation with Bert-based Domain Classification and Data Selection</article-title>
          ,
          <source>in: Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo</source>
          <year>2019</year>
          ),
          <year>2019</year>
          , pp.
          <fpage>76</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Peng</surname>
          </string-name>
          , E. Chersoni,
          <string-name>
            <given-names>Y.-Y.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-R.</given-names>
            <surname>Huang</surname>
          </string-name>
          , Is Domain Adaptation Worth your Investment?
          <article-title>Comparing Bert and Finbert on Financial Tasks</article-title>
          ,
          <source>in: Proceedings of the Third Workshop on Economics and Natural Language Processing</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.</given-names>
            <surname>Chalkidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fergadiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Aletras</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          , LEGAL-BERT:
          <article-title>The Muppets straight out of Law School, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>2898</fpage>
          -
          <lpage>2904</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rethmeier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Van Dijck</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Spanakis, VendorLink: An NLP approach for Identifying &amp; Linking Vendor Migrants &amp; Potential Aliases on Darknet Markets</article-title>
          ,
          <source>arXiv preprint arXiv:2305.02763</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B. W.</given-names>
            <surname>Hung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Muramudalige</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Jayasumana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Klausen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Libretti</surname>
          </string-name>
          , E. Moloney,
          <string-name>
            <given-names>P.</given-names>
            <surname>Renugopalakrishnan</surname>
          </string-name>
          ,
          <article-title>Recognizing Radicalization Indicators in Text Documents using Human-in-the-Loop Information Extraction and NLP Techniques, in: 2019 ieee international symposium on technologies for homeland security (hst)</article-title>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nayak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Timmapathini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ponnalagu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. G.</given-names>
            <surname>Venkoparao</surname>
          </string-name>
          ,
          <article-title>Domain Adaptation Challenges of Bert in Tokenization and Sub-word Representations of Out-of-vocabulary Words</article-title>
          ,
          <source>in: Proceedings of the First Workshop on Insights from Negative Results in NLP</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Benamar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Grouin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bothua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vilnat</surname>
          </string-name>
          ,
          <article-title>Evaluating Tokenizers Impact on Oovs Representation with Transformers Models</article-title>
          ,
          <source>in: Proceedings of the Thirteenth Language Resources and Evaluation Conference</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>4193</fpage>
          -
          <lpage>4204</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sushil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Suster</surname>
          </string-name>
          , W. Daelemans, Are We There Yet?
          <article-title>Exploring Clinical Domain Knowledge of Bert Models</article-title>
          ,
          <source>in: Proceedings of the 20th Workshop on Biomedical Language Processing</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>53</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Araci</surname>
          </string-name>
          ,
          <article-title>FinBERT: Financial Sentiment Analysis with Pre-trained Language Models</article-title>
          , arXiv preprint arXiv:
          <year>1908</year>
          .
          <volume>10063</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W.</given-names>
            <surname>Tai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Kung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Comiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-F.</given-names>
            <surname>Kuo</surname>
          </string-name>
          , exBERT:
          <article-title>Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>1433</fpage>
          -
          <lpage>1439</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Anderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Henderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <article-title>When does Pretraining Help? Assessing Self-supervised Learning for Law and the Casehold dataset of 53,000+ Legal Holdings</article-title>
          ,
          <source>in: Proceedings of the eighteenth international conference on artificial intelligence and law</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>159</fpage>
          -
          <lpage>168</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N.</given-names>
            <surname>Limsopatham</surname>
          </string-name>
          ,
          <article-title>Efectively Leveraging Bert for Legal Document Classification</article-title>
          ,
          <source>in: Proceedings of the Natural Legal Language Processing Workshop</source>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>210</fpage>
          -
          <lpage>216</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lamproudis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Henriksson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dalianis</surname>
          </string-name>
          ,
          <article-title>Evaluating Pretraining Strategies for Clinical Bert Models</article-title>
          ,
          <source>in: Proceedings of the Thirteenth Language Resources and Evaluation Conference</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>410</fpage>
          -
          <lpage>416</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>I.</given-names>
            <surname>Chalkidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fergadiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tsarapatsanis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Aletras</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          ,
          <article-title>Paragraph-level Rationale Extraction through Regularization: A case study on European Court of Human Rights Cases, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>226</fpage>
          -
          <lpage>241</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl-main.
          <volume>22</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>How to Fine-tune Bert for Text Classification?</article-title>
          ,
          <source>in: Chinese Computational Linguistics: 18th China National Conference, CCL</source>
          <year>2019</year>
          , Kunming, China,
          <source>October 18-20</source>
          ,
          <year>2019</year>
          , Proceedings 18, Springer,
          <year>2019</year>
          , pp.
          <fpage>194</fpage>
          -
          <lpage>206</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          , Roberta:
          <string-name>
            <given-names>A Robustly</given-names>
            <surname>Optimized Bert Pretraining Approach</surname>
          </string-name>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>I. Beltagy</surname>
          </string-name>
          , Matthew E. Peters, Arman Cohan,
          <article-title>Longformer: The Long-document Transformer</article-title>
          , arXiv:
          <year>2004</year>
          .
          <volume>05150</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , E. Duchesnay,
          <article-title>Scikit-learn: Machine Learning in Python</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          )
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bethard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dligach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sadeque</surname>
          </string-name>
          , G. Savova,
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Does Bert need Domain Adaptation for Clinical Negation Detection?</article-title>
          ,
          <source>Journal of the American Medical Informatics Association</source>
          <volume>27</volume>
          (
          <year>2020</year>
          )
          <fpage>584</fpage>
          -
          <lpage>591</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>F.</given-names>
            <surname>Elsafoury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Katsigiannis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ramzan</surname>
          </string-name>
          ,
          <article-title>On Bias and Fairness in NLP: How to have a fairer text classification?</article-title>
          ,
          <source>arXiv preprint arXiv:2305.12829</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>