<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BERTinchamps: Cost-efective Training of Large Language Models for Medical Tasks in French</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Amaury Fierens</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sébastien Jodogne</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Information and Communication Technologies, Electronics and Applied Mathematics, Louvain School of Engineering</institution>
          ,
          <addr-line>UCLouvain, 1348 Louvain-la-Neuve</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Many medical applications are envisioned for Large Language Models (LLMs), such as the automated summary of the health condition of a patient, or the automated codification of electronic health records. Even though the training of LLMs directly inside hospitals is highly desirable to exploit the local clinical data while avoiding data privacy concerns, this process requires a costly, complex computing infrastructure. This paper explores the recent Cramming approach as a cost-efective way to train LLMs within medical institutions in one day using one GPU. We show that the Cramming approach that was originally designed for English can be transposed to French, and that the resulting models can be successfully fine-tuned to healthcare-related tasks in the French language. This research opens the path to the creation of LLMs that are tailored to the specific needs of institutions that handle sensitive textual data in another language than English.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large language models</kwd>
        <kwd>Medical documents</kwd>
        <kwd>Downscaled training</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The field of Natural Language Processing (NLP) is currently attracting a lot of attention in the
context of healthcare [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Indeed, automating tasks such as the codification of electronic
health records (EHRs) could be highly valuable to monitor the quality of treatments [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], to help
with hospital payment reimbursement [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], to provide a summary of the health condition of a
patient [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], yet to detect diseases at an early stage by using clinical codes as biomarkers [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        In particular, the recent major advances in the field of Large Language Models (LLMs) are
opening great opportunities for NLP [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. A growing number of physicians are enthusiastic
about using modern tools such as the well-known ChatGPT chatbot to help with medical
tasks [
        <xref ref-type="bibr" rid="ref10 ref2">2, 10</xref>
        ]. As of June 2023, ChatGPT internally uses the closed-source, proprietary LLMs
GPT3.5 and GPT-4 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which introduces a strong dependency upon the proprietary infrastructure
of the OpenAI platform. But, the protection of patient privacy prevents the direct use of such
proprietary LLMs in a clinical context because they are generally cloud-based, while regulations
such as the General Data Protection Regulation (GDPR) in Europe forbid medical data from
leaving hospitals unless it has been at least pseudonymized. Similar dificulties are encountered
when a hospital seeks to exploit LLMs in the context of clinical research.
      </p>
      <p>
        This calls for the development of LLMs that can be entirely self-hosted inside the infrastructure
of a hospital. Self-hosting is evidently highly desirable for inference on the EHRs of the hospital.
However, besides inference, it is also important to be able to train LLMs inside the hospitals, to
ifne-tune them to the population of patients of one hospital or of one clinical department of
interest. Self-hosting can be notably achieved by taking advantage of LLMs whose architecture
has been published as open-source code, and whose pre-trained weights are available as open
data. Early LLMs available as open-source and open-data include BERT [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and GPT-2 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
More advanced models such as BLOOM [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], Cerebras-GPT [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], or LLaMA [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] are now
available.
      </p>
      <p>
        A dificulty with open, general-purpose LLMs is that they have been primarily trained on
English datasets, without a specialization on the clinical language. This has motivated researchers
to train LLMs using corpora containing medical documents. This is possible for English, for
which corpora of suficient size have emerged over the years. For instance, BioBERT [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
was trained on a dataset made of PubMed abstracts, while ClinicalBERT [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] was trained on
MIMIC-III [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. This shows that even though the BERT architecture has a number of parameters
that is much smaller than more recent models, the variations of BERT can still be considered as
compact Large Language Models with interesting applications related to healthcare.
Unfortunately, there is still a lack of large-scale medical corpora for most languages besides English. In
the context of French, only a handful of pre-trained LLMs for the clinical language are currently
available. Those include DrBERT [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] that leverages the RoBERTa architecture [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and that is
trained from scratch using the biomedical corpus NACHOS that is not publicly available yet at
the time of writing. The evaluation of such models on real-world clinical tasks is still a work in
progress due to the lack of suficiently large amount of domain-specific data in French.
      </p>
      <p>
        An alternative to the use of LLMs that are pre-trained for healthcare applications would be to
train the LLMs locally, inside the hospital, directly on the EHRs it hosts. This approach would
have the great advantage of training models that reflect the local patient population of the
hospital, while bringing privacy by design. By training a BERT model from scratch directly on
the local electronic health records of the hospital, the resulting language model could be better
adapted to healthcare-related tasks in that specific hospital, with respect to one model that would
have been obtained from the fine-tuning of a general French model such as CamemBERT [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
However, it is commonly considered that training a suficiently expressive LLM from scratch
is extremely demanding in terms of time and computational resources, as standard training
processes require dozens of days of computation on powerful Graphics Processing Unit (GPU)
that come at a high budget. Nevertheless, recent work has shown that it is possible to drastically
reduce the training time of BERT-like LLMs by slightly modifying the BERT architecture and
the way datasets are preprocessed [22]. This simplification is referred to as "Cramming", and the
resulting LLM is called "crammed BERT." To the best of our knowledge, the Cramming approach
has only been studied in the context of the English language and has not been applied to the
medical field so far.
      </p>
      <p>In this paper, we investigate the use of the Cramming recipe to train LLMs on
healthcarerelated tasks in French. Our results show that our crammed BERT model, referred to as
BERTinchamps, achieves a performance that is close to that of DrBERT on selected tasks related
to the medical field, while requiring only one single day of training. This contribution opens
the path to the training of LLMs directly inside institutions that generate sensitive textual data,
such as hospitals, at a reasonable cost, while preserving the privacy of data.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The idea of optimizing a LLM architecture to train it using less resources has already been
explored in the literature. In 2015, the idea of Knowledge Distillation was introduced in neural
networks [23], which enabled the training of smaller models from a bigger one, with little loss
in performance. This approach was used to create distilled versions of well-known LLMs such
as DistilBERT [24] and DistilGPT-2 [25]. Another technique consists in the quantization of
the model. Since its introduction in 1990 [26], quantization has been widely applied to the
Transformer-based architectures that underpin BERT, for instance in BinaryConnect [27], in
the Bondarenko et al. paper that introduced quantization for BERT [28], in Q8BERT [29], or in
BinaryBERT [30]. An even more recent paper has presented QLoRA [31], an eficient fine-tuning
method for quantized models.</p>
      <p>While those methods are extremely useful to reduce the size of an already existing LLM,
they are not designed to train LLMs from scratch at a decent cost. The Cramming recipe is
a recent contribution to serve this purpose [22]. The main goal of Cramming is to modify
the architecture and the training process of classical BERT-like models, while adapting the
preprocessing of the data, in order to determine how well such so-called "crammed BERT"
models can perform after having been trained on one single GPU for one single day.</p>
      <p>The Cramming approach is motivated by the scaling laws described by Kaplan et al. [32] that
hold in the low-resource regime. These scaling laws suggest that it is not necessarily useful to
reduce the number of parameters of BERT-like architectures. Instead of reducing the number of
parameters, the Cramming recipe optimizes the model architecture and adjusts the training
setup. Cramming also explores architectural enhancements that speed up the computation of
the gradients. Slight improvements are obtained by disabling the QKV biases in the multi-head
attention block [33], together with changes in the embedding blocks. The Cramming recipe also
proposes hyperparameters that are adapted to the training of crammed BERT models. Finally,
careful selection and processing of the training data is applied to extract well-suited tokens,
enhancing the overall performance of the crammed models. These contributions have been
shown to bring signicfiant improvements, enabling the fast training of BERT-like models for
English. However, the application of the Cramming recipe to other languages is still largely
unexplored. Our paper steps into this gap, by mirroring the advancements of the Cramming
approach in the French language, and by exploring how well crammed BERT models for French
behave on healthcare-related tasks.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>In this section, we describe our methodology and the parameters we used to mirror the
Cramming recipe in the French language. The resulting crammed BERT model is referred to as
BERTinchamps. The fine-tuning of BERTinchamps on selected healthcare-related tasks is then
discussed.</p>
      <sec id="sec-3-1">
        <title>3.1. Pre-training</title>
        <p>
          BERTinchamps was trained on the French part of the OSCAR dataset [34]. The OSCAR dataset
is an extensive multilingual corpus acquired through language classification and filtering of the
Common Crawl corpus employing the goclassy architecture. A subset of 17GB of the French part
of the OSCAR dataset was selected. Tokens were extracted using the WordPiece algorithm [35],
with a vocabulary size  = 32768. After tokenization, the dataset size was 37GB. The
cross-entropy loss was optimized, as it is usually the case if training BERT models. However,
to accelerate the training, the context was reduced from a maximum sequence length of 512
tokens to 128 tokens, as recommended in the Cramming paper. In the same vein, the training
objective was kept as masked language modeling, with the same masking proportions as in the
original setup described by Devlin et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>AdamW was used as the optimizer [36], which is a modified version of the Adam optimizer [ 37]
where weight decay is performed after controlling the parameter-wise step size. The parameters
of AdamW were set as follows: Weight decay equals 0.01,  1 = 0.9,  2 = 0.98, and  = 10− 12.
A gradient clipping of 0.5 was also included. The learning rate was set to 10− 3 and, contrarily
to the original instructions of the Cramming recipe, a slanted triangular learning rate was
used as the scheduler, with a base percentage of 25% and a fallof of 0.25 as parameters. This
adaptation was experimentally found to be more eficient for the French language than the
one-cycle learning rate used for English language in the original paper. This might be caused
either by the diference language structures, or by the content of the training dataset itself.</p>
        <p>A micro-batch size of 128 and a batch size of 4096 were used, the Cramming setup being
limited to the presence of one single GPU. This policy is rescheduled by linearly increasing the
averaged number of micro-batches over the training time. The model was trained for 24 hours,
as required by the Cramming recipe, on one NVIDIA A100 GPU with 40GB, and was called
BERTinchmaps. Figure 1 plots the evolution of the Masked Language Model (MLM) loss during
the training.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Fine-tuning</title>
        <p>
          In a first phase, three of the four datasets in the FLUE benchmark were used to assess the
overall performance of BERTinchamps. FLUE is an equivalent of the GLUE benchmark for
the French language [38]. In a second phase, to determine whether such a crammed model
was promising for tasks related to the medical domain, BERTinchamps was fine-tuned on
QUAEROFrenchMed [39], a French medical corpus that is made of two datasets, EMEA and
MEDLINE. The use of QUAEROFrenchMed enables the comparison of BERTinchamps against
DrBERT [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], a LLM for biomedical tasks in French.
3.2.1. FLUE
The FLUE benchmark is widely used to evaluate LLMs for the French language. It has notably
been used to test CamemBERT [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] and FlauBERT [38], two of the most powerful LLMs for
French at the time of writing. BERTinchamps was evaluated on three of the four datasets that
are comprised in the FLUE benchmark, namely the XNLI, CLS and PAWS-X datasets. Each
FLUE dataset serves a specific role. The XNLI dataset, a French subset of MNLI, is related to the
(a)
(b)
        </p>
        <p>Natural Language Inference (NLI) task that identifies logical relationships between premise and
hypothesis sentences. The CLS dataset is employed for text classification, using star-based labels
to categorize Amazon reviews of books, DVDs, and music. Despite its initial separation into
the latter three categories of reviews, we posit that classifier performance is not significantly
impacted by consolidating these three categories. The PAWS-X dataset targets paraphrasing
identification, where paired sentences are tagged as 1 for semantic equivalence or 0 otherwise.</p>
        <p>Most of the FLUE tasks involve fine-tuning for sequence classification, for which the approach
described in the Cramming paper was used. The AdamW optimizer was accordingly used for
the fine-tuning, with parameters  1 = 0.9,  2 = 0.98,  = 10− 6, and a smaller learning rate of
4 · 10− 5. The cosine-decay scheduler was experimentally found to provide better performance
for the fine-tuning. The model was trained for 10 epochs on CLS and PAWS-X, and for 5 epochs
on XNLI. The training batch size was set to 16, while the testing batch size was set to 128.
3.2.2. QUAEROFrenchMed
The QUAEROFrenchMed benchmark is a Medical Named Entity Recognition task in the French
language, whose purpose is to associate a clinical entity to each token of a medical text. The
QUAEROFrenchMed benchmark is made of two distinct datasets, namely EMEA and MEDLINE,
that share the same clinical entities of interest. MEDLINE is composed of a lot of short sentences
coming from MEDLINE article titles, while EMEA is composed of a few long documents coming
from drug descriptions.</p>
        <p>The QUAEROFrenchMed benchmark involves the classification of tokens, which contrasts
with the FLUE benchmark for which sequence classification fine-tuning was needed. To this
end, a Linear Layer was added at the end of the BERTinchamps pre-trained model, with an
output size equal to the number of clinical entities in QUAEROFrenchMed. This Linear Layer
was trained using cross-entropy loss. Moreover, the original datasets had to be adapted to meet
the requirements associated with this classification setup. For MEDLINE, the annotations had
to be processed, as some of the entities had multiple spans. As far as EMEA is concerned, the
dataset was first made like MEDLINE by splitting the long documents into their individual
sentences. The annotations also had to be processed for the same reason as MEDLINE. Each
word of both datasets was tokenized and each sentence of EMEA was identified using the French
blank model of the spaCy Python package [40] along with its Sentencizer tool. For both EMEA
and MEDLINE, all the resulting words, along with their labels, were stored as a JSON file for
further processing.</p>
        <p>The AdamW optimizer was again used to fine-tune BERTinchamps on the EMEA and
MEDLINE tasks. The default implementation of the AdamW trainer in the PyTorch package was
used [41], with a smaller learning rate of 10− 4. The slanted triangular learning rate was used
as the scheduler, as it provided better performance in this case. The model was trained for 100
epochs, for both EMEA and MEDLINE. Both the training and testing batch sizes were set to 8.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>This section first presents the overall performance of the BERTinchamps model according to
the FLUE benchmark. Secondly, the specific capabilities of the model related to the medical
language are tested on the QUAEROFrenchMed benchmark.
4.1. FLUE
As explained in Section 3.2.1, BERTinchamps was compared to CamemBERT and FlauBERT,
two state-of-the-art LLMs for the French language, on the CLS, PAWS-X, and XNLI tasks of
the FLUE benchmark. Because the BERTinchamps model is a crammed BERT model with 110
millions of parameters, versions of CamemBERT and FlauBERT with a comparable number of
parameters were considered (i.e., CamemBERTbase and FlauBERTbase that respectively contain
110 and 138 millions of parameters). Table 1 reports the final accuracy of each model on each
task. The results show a diference of less than 4% in performance between the BERTinchamps
crammed model and the state-of-the-art models.
†Results reported in FlauBERT paper [38]
* Results averaged from the 3 categories</p>
      <sec id="sec-4-1">
        <title>4.2. QUAEROFrenchMed</title>
        <p>The capabilities of BERTinchamps on healthcare-related tasks in the French language were
evaluated on EMEA and MEDLINE, the two datasets of the QUAEROFrenchMed benchmark. To
this end, BERTinchamps was compared to both DrBERT, a recent LLM trained on biomedical data,
and CamemBERT. The three LLMs were fine-tuned on QUAEROFrenchMed for the classification
of tokens, using the experimental setup described in Section 3.2.2. The results are reported in
Table 2. As can be seen in this table, the accuracy of BERTinchamps is close to DrBERT, and
slightly worse than CamemBERT on the investigated tasks.</p>
        <p>Tables 3 shows the label-level results for BERTinchamps and DrBERT. The 1-score is reported
together with the support for each label. Mismatch between the support counts is due to the
use of diferent tokenizers in the two models, as the labels are put on the tokens that compose
each word. Interestingly, while Table 2 tends to indicate that DrBERT provides better accuracy
than BERTinchamps, Table 3 shows that BERTinchamps outperforms DrBERT in the MEDLINE
dataset on every label but the 0 label, which is the default label of words without annotations.
Moreover, BERTinchamps outperforms DrBERT on 6 out of 11 labels of the EMEA dataset. This
tends to show that BERTinchamps and DrBERT share similar performance on the considered
tasks.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.3. Discussion</title>
        <p>The results of Section 4.1 on the FLUE benchmark provide evidence that the Cramming recipe
can be transposed to the French language, even though it was originally designed for English.
The BERTinchamps crammed model performs well compared to state-of-the-art LLMs, even
though it has only been trained on a single NVIDIA A100 GPU for 24 hours, which amounts to
4.1 exaFLOP. In comparison, CamemBERT has been trained on 256 NVIDIA V100 GPUs for 24
hours, for a total of 100 exaFLOP, while FlauBERT has been trained on 32 NVIDIA V100 GPUs
for 410 hours, for a total of 210 exaFLOP.</p>
        <p>In addition, Section 4.2 indicates that the BERTinchamps model is promising on
medicalrelated tasks, competing with a specialized LLM like DrBERT on an experimental setup derived
from the QUAEROFrenchMed benchmark. The training of DrBERT has required 128 NVIDIA
V100 GPUs for 20 hours, for a total of 41 exaFLOP. Summarizing, given the overall performance
of BERTinchamps together with its low training cost, the Cramming recipe is a highly promising
path to the local training of LLMs directly inside hospitals.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This paper contains two significant findings. Firstly, the Cramming recipe that was originally
designed for English can be applied to the French language, with comparable efectiveness,
resulting in the BERTinchamps model. This finding also suggests that Cramming is likely to be
useful in other languages. Secondly, BERTinchamps can be fine-tuned to tasks that are related
to the French medical language. Taken together, these two findings suggest that the self-hosted
training of LLMs from scratch is within the reach of French-speaking institutions handling
sensitive data, which includes the hospitals. By accelerating the training of LLMs by an order of
magnitude, the Cramming recipe has the potential to strongly reduce the cost and complexity
of the infrastructure to train LLMs from scratch in diferent languages. This not only opens the
door to numerous applications within the realm of medical NLP, but also allows the creation of
LLMs that are tailored to the very specific needs of the institutions where they are deployed.</p>
      <p>Future work will consist in demonstrating the feasibility of training crammed models inside a
hospital, for a real-world clinical task such as the automated codification of the EHRs managed
by the hospital. Another promising research path will consist in leveraging federated learning
for the collaborative training of one crammed model that is shared by a coalition of hospitals.
Meeting of the Association for Computational Linguistics, Association for Computational
Linguistics, Online, 2020, pp. 7203–7219. URL: https://aclanthology.org/2020.acl-main.645.
doi:10.18653/v1/2020.acl-main.645.
[22] J. Geiping, T. Goldstein, Cramming: Training a Language Model on a Single GPU in One</p>
      <p>Day, 2022. doi:10.48550/arXiv.2212.14034.
[23] G. Hinton, O. Vinyals, J. Dean, Distilling the Knowledge in a Neural Network, 2015.</p>
      <p>doi:10.48550/arXiv.1503.02531.
[24] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller,
faster, cheaper and lighter, 2020. doi:10.48550/arXiv.1910.01108.
[25] T. Li, Y. E. Mesbahi, I. Kobyzev, A. Rashid, A. Mahmud, N. Anchuri, H. Hajimolahoseini,
Y. Liu, M. Rezagholizadeh, A Short Study on Compressing Decoder-Based Language Models,
2021.
[26] E. Fiesler, A. Choudry, H. J. Caulfield, Weight discretization paradigm for optical neural
networks, in: H. Bartelt (Ed.), SPIE Proceedings, volume 1281 of Optical Interconnections
and Networks, SPIE, The Hague, Netherlands, 1990, pp. 164–173. doi:10.1117/12.20700.
[27] M. Courbariaux, Y. Bengio, J.-P. David, BinaryConnect: Training Deep Neural Networks
with binary weights during propagations, 2016. doi:10.48550/arXiv.1511.00363.
[28] Y. Bondarenko, M. Nagel, T. Blankevoort, Understanding and Overcoming the Challenges
of Eficient Transformer Quantization, 2021. doi: 10.48550/arXiv.2109.12948.
[29] O. Zafrir, G. Boudoukh, P. Izsak, M. Wasserblat, Q8bert: Quantized 8bit bert, in:
2019 Fifth Workshop on Energy Eficient Machine Learning and Cognitive Computing
NeurIPS Edition (EMC2-NIPS), IEEE, Vancouver, Canada, 2019, pp. 36–39. doi:10.1109/
EMC2-NIPS53020.2019.00016.
[30] H. Bai, W. Zhang, L. Hou, L. Shang, J. Jin, X. Jiang, Q. Liu, M. Lyu, I. King, BinaryBERT:</p>
      <p>Pushing the Limit of BERT Quantization, 2021. doi:10.48550/arXiv.2012.15701.
[31] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, QLoRA: Eficient Finetuning of</p>
      <p>Quantized LLMs, 2023.
[32] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford,
J. Wu, D. Amodei, Scaling Laws for Neural Language Models, 2020. doi:10.48550/arXiv.
2001.08361.
[33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser,
I. Polosukhin, Attention is all you need, in: Advances in Neural Information Processing
Systems, volume 30, Curran Associates, Inc., Los Angeles, USA, 2017, pp. 5998–6008.
[34] P. J. Ortiz Suárez, L. Romary, B. Sagot, A monolingual approach to contextualized word
embeddings for mid-resource languages, in: Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, Association for Computational Linguistics,
Online, 2020, pp. 1703–1714. URL: https://aclanthology.org/2020.acl-main.156. doi:10.
18653/v1/2020.acl-main.156.
[35] M. Schuster, K. Nakajima, Japanese and korean voice search, in: 2012 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, Kyoto, Japan,
2012, pp. 5149–5152.
[36] I. Loshchilov, F. Hutter, Decoupled Weight Decay Regularization, 2019. doi:10.48550/
arXiv.1711.05101.
[37] D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, 2017. doi:10.48550/
arXiv.1412.6980.
[38] H. Le, L. Vial, J. Frej, V. Segonne, M. Coavoux, B. Lecouteux, A. Allauzen, B. Crabbé,
L. Besacier, D. Schwab, FlauBERT: Unsupervised Language Model Pre-training for French,
2020. doi:10.48550/arXiv.1912.05372.
[39] A. Névéol, C. Grouin, J. Leixa, S. Rosset, P. Zweigenbaum, The QUAERO French
medical corpus: A ressource for medical entity recognition and normalization, in: Proc of
BioTextMining Work, ELRA, Reykjavik, Iceland, 2014, pp. 24–30.
[40] M. Honnibal, I. Montani, spaCy 2: Natural language understanding with Bloom embeddings,
convolutional neural networks and incremental parsing, 2017. To appear.
[41] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,
N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style,
high-performance deep learning library, in: Advances in Neural Information Processing
Systems 32, Vancouver, Canada, 2019, pp. 8024–8035.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bassignana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Brunato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramponi</surname>
          </string-name>
          , Preface to the
          <source>Seventh Workshop on Natural Language for Artificial Intelligence (NL4AI)</source>
          ,
          <source>in: Proceedings of the Seventh Workshop on Natural Language for Artificial Intelligence (NL4AI</source>
          <year>2023</year>
          )
          <article-title>co-located with 22th International Conference of the Italian Association for Artificial Intelligence (AI* IA</article-title>
          <year>2023</year>
          ),
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kleesiek</surname>
          </string-name>
          , J. Egger, ChatGPT in Healthcare: A Taxonomy and
          <string-name>
            <given-names>Systematic</given-names>
            <surname>Review</surname>
          </string-name>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .1101/
          <year>2023</year>
          .03.30.23287899, pages:
          <year>2023</year>
          .
          <volume>03</volume>
          .30.23287899.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shi</surname>
          </string-name>
          , S. Ma,
          <article-title>Natural Language Processing for Smart Healthcare</article-title>
          , IEEE Reviews in Biomedical Engineering abs/2110.15803 (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          . doi:
          <volume>10</volume>
          .1109/RBME.
          <year>2022</year>
          .
          <volume>3210270</volume>
          , conference Name: IEEE Reviews in Biomedical Engineering.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Pronovost</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Cole</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Hughes</surname>
          </string-name>
          ,
          <source>Remote Patient Monitoring During COVID-19: An Unexpected Patient Safety Benefit, JAMA</source>
          <volume>327</volume>
          (
          <year>2022</year>
          )
          <fpage>1125</fpage>
          -
          <lpage>1126</lpage>
          . doi:
          <volume>10</volume>
          .1001/jama.
          <year>2022</year>
          .
          <year>2040</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , C. Cheng, D. Ou,
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Construction of a semi-automatic icd-10 coding system, BMC medical informatics and decision making 20 (</article-title>
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . doi:
          <volume>10</volume>
          .1186/ s12911-020-1085-4.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V. J.</given-names>
            <surname>Watzlaf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Garvin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Moeini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Anania-Firouzan</surname>
          </string-name>
          ,
          <article-title>The Efectiveness of ICD-10-CM in Capturing Public Health Diseases</article-title>
          ,
          <source>Perspectives in Health Information Management / AHIMA, American Health Information Management Association</source>
          <volume>4</volume>
          (
          <year>2007</year>
          )
          <article-title>6</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Poongodi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sumathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Suresh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Balusamy</surname>
          </string-name>
          ,
          <article-title>Deep Learning Techniques for Electronic Health Record (EHR) Analysis</article-title>
          , in: A.
          <string-name>
            <surname>K. Bhoi</surname>
            ,
            <given-names>P. K.</given-names>
          </string-name>
          <string-name>
            <surname>Mallick</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-M. Liu</surname>
          </string-name>
          , V. E. Balas (Eds.),
          <string-name>
            <surname>Bio-inspired</surname>
            <given-names>Neurocomputing</given-names>
          </string-name>
          ,
          <source>Studies in Computational Intelligence</source>
          , Springer, Singapore,
          <year>2021</year>
          , pp.
          <fpage>73</fpage>
          -
          <lpage>103</lpage>
          . doi:
          <volume>10</volume>
          .1007/
          <fpage>978</fpage>
          -981-15-5495-7\_5.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Khurana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khatter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Natural language processing: state of the art, current trends and challenges</article-title>
          ,
          <source>Multimedia Tools and Applications</source>
          <volume>82</volume>
          (
          <year>2023</year>
          )
          <fpage>3713</fpage>
          -
          <lpage>3744</lpage>
          . doi:
          <volume>10</volume>
          .1007/s11042-022-13428-4.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9] OpenAI, GPT-4
          <source>Technical Report</source>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2303.08774.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Biswas</surname>
          </string-name>
          ,
          <article-title>Role of Chat GPT in Public Health</article-title>
          ,
          <source>Annals of Biomedical Engineering</source>
          <volume>51</volume>
          (
          <year>2023</year>
          )
          <fpage>868</fpage>
          -
          <lpage>869</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10439-023-03172-7.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          ,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1810</year>
          .
          <volume>04805</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          , et al.,
          <article-title>Language models are unsupervised multitask learners</article-title>
          ,
          <source>OpenAI blog 1</source>
          (
          <year>2019</year>
          )
          <article-title>9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Scao</surname>
          </string-name>
          , et al.,
          <source>BLOOM: A</source>
          <string-name>
            <surname>176B-Parameter Open-Access Multilingual Language Model</surname>
          </string-name>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2211.05100.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dey</surname>
          </string-name>
          , G. Gosal, Zhiming, Chen,
          <string-name>
            <given-names>H.</given-names>
            <surname>Khachane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Marshall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pathria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hestness</surname>
          </string-name>
          , Cerebras-GPT:
          <article-title>Open Compute-Optimal Language Models Trained on the Cerebras WaferScale Cluster</article-title>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2304.03208.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Joulin</surname>
            , E. Grave, G. Lample, LLaMA: Open and
            <given-names>Eficient</given-names>
          </string-name>
          <string-name>
            <surname>Foundation Language Models</surname>
          </string-name>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2302.13971.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Kang,</surname>
          </string-name>
          <article-title>BioBERT: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>36</volume>
          (
          <year>2019</year>
          )
          <fpage>1234</fpage>
          -
          <lpage>1240</lpage>
          . doi:
          <volume>10</volume>
          .1093/bioinformatics/btz682.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>K.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Altosaar</surname>
          </string-name>
          , R. Ranganath,
          <source>ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission</source>
          ,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1904</year>
          .
          <volume>05342</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A. E. W.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , T. J.
          <string-name>
            <surname>Pollard</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>L.-w. H.</given-names>
          </string-name>
          <string-name>
            <surname>Lehman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ghassemi</surname>
            , B. Moody, P. Szolovits,
            <given-names>L. Anthony</given-names>
          </string-name>
          <string-name>
            <surname>Celi</surname>
          </string-name>
          , R. G.
          <article-title>Mark, MIMIC-III, a freely accessible critical care database</article-title>
          ,
          <source>Scientific Data</source>
          <volume>3</volume>
          (
          <year>2016</year>
          )
          <article-title>160035</article-title>
          . doi:
          <volume>10</volume>
          .1038/sdata.
          <year>2016</year>
          .
          <volume>35</volume>
          ,
          <issue>number</issue>
          : 1 Publisher: Nature Publishing Group.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Labrak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bazoge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dufour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rouvier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Morin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Daille</surname>
          </string-name>
          , P.-A. Gourraud,
          <article-title>DrBERT: A Robust Pre-trained Model in French for Biomedical and Clinical domains</article-title>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          . 48550/arXiv.2304.00958.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          ,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>1907</year>
          .
          <volume>11692</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Muller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J. Ortiz</given-names>
            <surname>Suárez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dupont</surname>
          </string-name>
          , L. Romary, É. de la Clergerie,
          <string-name>
            <given-names>D.</given-names>
            <surname>Seddah</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Sagot,</surname>
          </string-name>
          <article-title>CamemBERT: a tasty French language model</article-title>
          ,
          <source>in: Proceedings of the 58th Annual</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>