<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>X (L. Pamio);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Comparing CRF vs BERT Models for Named Entity Recognition and Relation Extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lorenzo Pamio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgio Maria Di Nunzio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering, University of Padua</institution>
          ,
          <addr-line>Padova</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>This paper presents our participation in the CLEF 2025 GutBrainIE challenge, addressing tasks in Named Entity Recognition (NER) and Relation Extraction (RE) on biomedical texts related to the gut-brain axis. We explored both traditional and modern approaches, including Conditional Random Fields (CRFs) with hand-engineered features and fine-tuned BERT-based models. For RE, we focused on a simplified pipeline using BiomedBERT, coupled with NER outputs to extract binary and ternary relations. Our experiments revealed the limitations of CRFs in this domain and highlighted the variability and sensitivity of BERT-based models to training stability and dataset noise. While our NER performance was mid-ranked, we achieved competitive results in RE, particularly in ternary tag-based extraction. We also reflect on the efects of model selection, loss function design, and data configurations, ofering insights for future work in biomedical IE.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;CRF model</kwd>
        <kwd>BERT model</kwd>
        <kwd>Fine tuning</kwd>
        <kwd>NER</kwd>
        <kwd>RE</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Named Entity Recognition is a task of Natural Language Processing that has as objective classifying
entities inside text, we will refer to this type of task with the acronym of NER. Early approaches
to solve this task were using rule-based systems and feature engineering often using models like
Conditional Random Fields (CRFs) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. With the rise of deep learning, neural network architectures
have become dominant. More specifically, transformer-based models like BERT [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] have really improved
the performance in this task. Relation extraction similarly focuses on identifying relationships between
entities, during this paper we will refer to this tasks with the acronym of RE. Recent advancements in
deep learning have significantly improved performance in this task as well. In particular, models such as
BiomedBERT [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], developed by Microsoft, have contributed substantially to progress in the biomedical
domain. The work presented in this paper builds upon the foundation established by BiomedBERT,
which serves as a core component of our approach.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>We will now define briefly the subtasks about NER and RE, to get more about</p>
      <sec id="sec-3-1">
        <title>3.1. Named Entity Recognition</title>
        <p>Given a set  of labels, an ordered sequence  of tokens of size  and a set  of functions that can map
a token into a label, we formally define the problem of NER as:
 :   → ,</p>
        <p>where ((1, ..., )) = (1, ..., ),  ∈</p>
        <p>The objective of this task is to assign a label to each token in a given token set, minimizing the overall
loss with respect to a known ground truth. This process should ideally consider not only individual
tokens but the entire token sequence  for context-aware predictions.</p>
        <p>ℒ() = ∑︁ loss(( ()), ())</p>
        <p>* = argmin
∈
ℒ()</p>
        <p>The loss can be defined in various ways. Ideally, it could be expressed as the negative of a reward
function, allowing us to optimize for the function that yields the best overall performance.</p>
        <p>The ideal approach to the task assumes that the loss can be computed eficiently for a given label
set . However, in real-world scenarios, this loss function is often not directly computable due to the
inherent ambiguity in assigning a token to a specific label, as well as the subjective nature of human
annotation that may label the same entity diferently. In practice, applying this approach requires
defining the initial token set  as a sequence of tokens that, when combined, reconstruct the document.
Similarly, the label domain  is constrained by the task’s scope, and the number of labels is limited to a
ifnite, positive integer.</p>
        <p>= {(1, 2)|1, 2 ∈ , if exists a relation between 1, 2}
 = {(1, , 2)|1, 2 ∈ ,  ∈ , if exists a relation between 1, 2}
 = {(1, , 2, 1, 2)|1, 2 ∈ ,  ∈ , if exists a relation between 1, 2}
(1)
(2)
(3)
(4)</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Relation Extraction</title>
        <p>The RE task works given a set of entities , with their attributes text span, position and label, and a set
of predicates  , the objective of the task is to associate possible relations between entities.</p>
        <p>∈  1, 2 ∈  1, 2 ∈ 
The labels 1, 2 are defined as the labels associated with the entities 1, 2 respectively. The task can be
specified in diferent ways depending on the subtask. In subtask 6.2.1, the objective is to determine
whether a relation exists between the two given labels. Subtask 6.2.2 extends this requirement by also
identifying the specific predicate that characterizes the relation. Subtask 6.2.3 further requires the
extraction of the text spans corresponding to the related entities, in addition to identifying the predicate.</p>
        <p>In the following formula  will refer to subtask 6.2.1 about binary tag-based RE,  will refer
to subtask 6.2.2 about ternary tag-based RE and finally  will refer to subtask 6.2.3 about ternary
mention-based RE.</p>
        <p>In the equation defining  the spans 1, 2 are defined as the spans in the text associated with
respect to the labels and entities 1, 2</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. CRF model</title>
        <p>
          We began by developing a model based on Conditional Random Fields (CRFs), aiming to build it from
scratch and evaluate its performance in the specific domain of the GutBrainIE task. CRF models are
statistical modeling methods that incorporate contextual information, making them well-suited for
sequence labeling tasks like NER [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. CRFs rely heavily on feature design and transition probabilities.
Since the predictions are derived from input features, feature engineering plays a crucial role in
determining the model’s capabilities [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. For this challenge, we designed a custom feature set tailored
to the biomedical domain and the structure of the provided texts.
        </p>
        <p>
          The CRF model was modified in diferent ways from the default configuration and its hyperparameters
has been tweaked to obtain diferent types of performances. Specifically trained models were based
on the package sklearn_crfsuite2 which provides diferent training algorithms like lbfgs [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], l2sgd [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ],
ap [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], pa [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], arow[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Among these algorithms, the best performances have been obtained with the
lbfgs method, which has therefore been chosen for being integrated in the final model.
        </p>
        <p>
          In addition to the training algorithm, several important parameters were tuned to control the model’s
regularization behavior and feature handling:
• c1, value responsible for the LASSO regression [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]
• c2, value responsible for the RIDGE regression [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
• all_possible_transitions, boolean value responsible for evaluating even the non-present transition
in the training dataset
• min_freq, value responsible for evaluating the minimal frequency in which a feature needs to be
present to be taken into account by the model
        </p>
        <p>
          The feature engineering applied to these CRF models involves a standard set of features used to label
tokens and extract relations. The core idea behind feature engineering is to process an entire document
token by token, extracting specific features for each token as well as information about its surrounding
context [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. In the specific case of this challenge, to represent the current token, we used the following
information:
1. ’word.lower()’: the lowercase representation of the token
2. ’word[-3:]’: last 3 chars of the token
3. ’word[-2:]’: last 2 chars of the token
4. ’word.isupper()’: if the token is uppercase
5. ’word.istitle()’: if the token is title
6. ’word.hasCapital()’: if the token has capital letter
7. ’word.isdigit()’: if the token is a digit
8. ’word.isGene()’: a custom implementation, if the token is a scientific representation of a gene
9. ’postag’: postag of token
10. ’postag[:2]’: first 2 chars of postag
11. ’word.length()’: length of token
12. ’word.pos()’: postion of the token in the phrase
We also incorporated, whenever possible, features derived from the preceding and following tokens to
enrich the representation of the current token. These contextual features consist of a subset of those
used for the current token itself, specifically features 1, 4, 5, 9, and 10, i.e., word.lower, word.isupper,
word.istitle, postag, and postag[:2].
2https://sklearn-crfsuite.readthedocs.io/en/latest/
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. BERT models</title>
        <p>In addition to CRF models, we also adopted an approach based on fine-tuning pre-trained models. This
ifne-tuning process aimed to improve the base performance of various models available through the
HuggingFace library3. Several types of models were considered, and each was specifically trained to
achieve the best possible performance within the subtasks constraints. The models we fine-tuned and
subsequently submitted to the challenge were:
• scibert-scivocab-uncased4
• biobert-base-cased-v1.25
• BiomedNLP-BiomedBERT-base-uncased-abstract6
• biosyn-sapbert-bc2gn7
• NuNER-v2.08
All of them (except NuNER-v2.0) were specifically pre-trained on scientific and/or bio-related corpora
of documents that enhanced the performance in our specific domain.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>The datasets provided by the competition organizers are composed of entities and relationships between
them inside titles and abstracts of PubMed abstracts.</p>
        <p>Regarding the challenge, the provided datasets include:
• Entity Mentions: Text spans classified into predefined categories.
• Relations: Associations between entities, specifying that a particular relationship holds between
two entities.</p>
        <p>In the specific instance of the GutBrainIE challenge, the corpus of documents was annotated in
diferent ways:
• Platinum collection, highest-quality annotations, expert-curated and reviewed by external
biomedical specialists.
• Gold collection, high-quality annotations, expert-curated.
• Silver collection, mid-quality annotations, created by trained students under expert supervision.
• Bronze collection, automatically generated annotations.</p>
        <p>• Dev collection, used as test set.</p>
        <p>Working on the 6.1 subtask about NER, our setup was split into two main working pipeline regarding
respectively a CRF model (Section 3.3) trained from scratch and a pipeline to fine tune BERT models
(Section 3.4).</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Named Entity Recognition</title>
        <p>The setup used in this subtask is mainly related to the hyperparameters of the models themselves. We
also tweaked the domain and format of the training set used for the task, although the main focus in
this part of the challenge was placed more on the models than on data processing.
3https://huggingface.co/docs/huggingface_hub/guides/overview
4https://huggingface.co/allenai/scibert_scivocab_uncased
5https://huggingface.co/dmis-lab/biobert-base-cased-v1.2
6https://huggingface.co/microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract
7https://huggingface.co/dmis-lab/biosyn-sapbert-bc2gn
8https://huggingface.co/numind/NuNER-v2.0</p>
        <sec id="sec-4-2-1">
          <title>4.2.1. CRF models</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. BERT models</title>
          <p>
            The main diferences between the setup and the overall models produced with CRF are shown in Table 2.
We mainly adjusted values associated with regularization functions, specifically L1 ( c1_value) and
L2 (c2_value). The min_freq parameter was kept at 0 to ensure that every feature present in the
training dataset was captured. We also varied the amount and type of data used for training.
BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model
pretrained on large corpora using a masked language modeling objective [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. Its success in a wide range of
NLP tasks has made it a natural choice for sequence classification and token-level prediction tasks. A
key concern with BERT models and the training pipeline was the stability of the process. Indeed, in
some training iterations, the loss function fluctuated significantly, leading to considerable variation in
the results. To address this and improve stability, we adjusted the unstable models’ hyperparameters.
          </p>
          <p>We also decided to use only one model implementing the CustomWeight loss function, as most of
the domain-specific scientific or biomedical models did not yield the performance improvements we
had hoped for.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Relation Extraction</title>
        <p>For the RE subtasks, we relied heavily on a single model; the
BiomedNLP-BiomedBERT-base-uncasedabstract model, focusing more on optimizing one single model for all the RE subtasks. To extract
relations from the text, the RE model had to be paired with a NER model capable of identifying the
entities to be used in the subsequent steps.</p>
        <p>We decided to experiment with the following fine-tuned NER models 9:
• biosyn-sapbert-bc2gn-1210
• scibert-27 11
• NuNerv2.0-22-CW-xtreme12
The biosyn-sapbert-bc2gn-12 model has been chosen because it was expected to have the best theoretical
performance due to its scientific and biorelated pre-training.</p>
        <p>The scibert-27 model has been chosen because the 47-epoch version seemed like a model that could
have overfitted over some of the data.</p>
        <p>The NuNerv2.0-22-CW-xtreme model has been chosen because it had the most generic domain training
background, it had the best performance in unseen data and because it was relying on our CustomWeight
loss function.</p>
        <p>During the development of these RE models, we defined a metric that was used as the main varying
parameter. This parameter has been called norel_ratio.</p>
        <p>norel_ratio = | |
| |
Where  is a set of relations that are labeled as negative, denoting a non-existing link between two
entities in the text. The  set consists of existing relation between entities in the text. In the specific
instance of this study, we always used the entirety of the positive instances of relation as a starting
point to compute the  set of non-existing relations. To create the set  of negative instances we have
used a random approach, extracting and inserting in this set relationships that didn’t exist between
random entities. These models actually have been trained with 3 iterations of the BiomedBERT RE
model, where the norel_ratio has been tweaked and ranged from 1 to 3</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>The total number of submitted runs was 37. Out of these 37, 10 were related to the first subtasks about
NER (3.1), and the remaining 27 were distributed equally over the 3 RE (3.2) subtasks. As shown in
table 5, the results on the NER subtask 6.1 show that BERT models had the best performance overall
(considering the micro-f1 score as the reference metric).</p>
      <p>The customCRF models, trained from scratch (see Section 3.3), did not perform as well as the other
approaches. Similarly, the Custom Weight scheme, which was applied to the BERT models through a
custom loss function and initially showed promising results during early evaluation, ultimately ranked
lower both in terms of average position and Micro F1-score when compared to other BERT-based
models. This result was expected, as we anticipated that the most general-purpose configuration would
yield the weakest performance among the BERT variants.</p>
      <p>Concerning RE (3.2) subtasks, average performances of proposed models are similar. Analyzing
models’ behaviors reported in Tables 6,7,8, we can see that overall the best micro-f1 score has been
obtained with models having a higher ratio of no_relation over efective relation in the training dataset.</p>
      <p>Even though the overall F1-score distribution was variable, it is worth noting that, in Task 6.2.2, some
models trained with a ratio of 1 achieved a high macro-F1 score. This indicates strong performance
across all relation classes, suggesting that these models were efective in distinguishing between diferent
types of relations.
(5)
9These models have been fine-tuned in the NER subtask
10Base model at https://huggingface.co/dmis-lab/biosyn-sapbert-bc2gn
11Base model at https://huggingface.co/allenai/scibert_scivocab_uncased
12Base model at https://huggingface.co/numind/NuNER-v2.0</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>Our participation in this task showed that for the NER subtasks, although we explored diferent
approaches, our results were not among the top performers. However, the trend was diferent for the
RE subtasks. We achieved satisfying results in subtask 6.2.2, and overall, our performances in the 6.2
subtasks were better than in subtask 6.1, this results are summarized in Table 9.</p>
      <p>
        Promising directions for future work include the evaluation of larger models performance gains they
may bring in this specific domain. Additionally, we aim to investigate the optimal no_rel ratio and
how changes to this parameter afect model performance, clarifying whether this value has a generally
applicable threshold or if it is domain-dependent. In addition, we aim to integrate a semantic perspective
grounded in linguistic analysis to enrich the linguistic and conceptual interpretation of extracted terms
and relations. Specifically, we would like to apply semic analysis, which decomposes terms into minimal
semantic units, as a structured approach to uncovering the internal organization of meaning in medical
terminology [
        <xref ref-type="bibr" rid="ref13">13, 14</xref>
        ]. Incorporating this technique may enhance our ability to align terminological
outputs with underlying conceptual structures, improving not only model interpretability but also the
precision of the extraction of named entities and objects in domain-specific biomedical contexts.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work is partially supported by the HEREDITARY Project, as a part of the European Union’s Horizon
Europe research and innovation programme under grant agreement No GA 101137074.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.
Atoms of Meaning to Study Polysemy and Polyreferentiality, Languages 9 (2024) 121. URL: https:
//www.mdpi.com/2226-471X/9/4/121. doi:10.3390/languages9040121, number: 4 Publisher:
Multidisciplinary Digital Publishing Institute.
[14] V. Bonato, G. M. Di Nunzio, F. Vezzani, Preliminary Considerations on a Systematic Approach to
Semic Analysis: The Case Study of Medical Terminology, Umanistica Digitale (2021) 211–234. URL:
https://umanisticadigitale.unibo.it/article/view/12621. doi:10.6092/issn.2532-8816/12621,
number: 10.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rodríguez-Ortega</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodriguez-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Loukachevitch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakhovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitriadis</surname>
          </string-name>
          , G. Tsoumakas,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannakoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bekiaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Samaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          , G. Silvello, G. Paliouras,
          <source>Overview of BioASQ</source>
          <year>2025</year>
          :
          <article-title>The thirteenth BioASQ challenge on large-scale biomedical semantic indexing and question answering</article-title>
          ,
          <source>volume TBA of Lecture Notes in Computer Science</source>
          , Springer,
          <year>2025</year>
          , p.
          <source>TBA.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Martinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Silvello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bonato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Irrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Menotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vezzani</surname>
          </string-name>
          , Overview of GutBrainIE@CLEF 2025:
          <article-title>Gut-Brain Interplay Information Extraction</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>CLEF 2025 Working Notes</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Laferty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. C. N.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <article-title>Conditional random fields: Probabilistic models for segmenting and labeling sequence data</article-title>
          ,
          <source>in: Proceedings of the Eighteenth International Conference on Machine Learning</source>
          , ICML '
          <fpage>01</fpage>
          , Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
          <year>2001</year>
          , pp.
          <fpage>282</fpage>
          -
          <lpage>289</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          , CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ). URL: http://arxiv.org/abs/
          <year>1810</year>
          .04805. arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tinn</surname>
          </string-name>
          , H. Cheng, M. Lucas,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <article-title>Domain-specific language model pretraining for biomedical natural language processing</article-title>
          , CoRR abs/
          <year>2007</year>
          .15779 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2007</year>
          .15779. arXiv:
          <year>2007</year>
          .15779.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nocedal</surname>
          </string-name>
          ,
          <article-title>On the limited memory bfgs method for large scale optimization</article-title>
          ,
          <source>Mathematical programming 45</source>
          (
          <year>1989</year>
          )
          <fpage>503</fpage>
          -
          <lpage>528</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          , Stochastic Gradient Descent Tricks, Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2012</year>
          , pp.
          <fpage>421</fpage>
          -
          <lpage>436</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -35289-8_
          <fpage>25</fpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>642</fpage>
          -35289-8_
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Collins</surname>
          </string-name>
          ,
          <article-title>Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms</article-title>
          ,
          <source>in: Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP</source>
          <year>2002</year>
          ), Association for Computational Linguistics,
          <year>2002</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . URL: https://aclanthology.org/W02-1001/. doi:
          <volume>10</volume>
          .3115/1118693.1118694.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Crammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Dekel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Keshet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shalev-Shwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Singer</surname>
          </string-name>
          ,
          <article-title>Online passive-aggressive algorithms</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>7</volume>
          (
          <year>2006</year>
          )
          <fpage>551</fpage>
          -
          <lpage>585</lpage>
          . URL: http://jmlr.org/papers/v7/ crammer06a.html.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Crammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kulesza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dredze</surname>
          </string-name>
          ,
          <article-title>Adaptive regularization of weight vectors</article-title>
          , in: Y.
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schuurmans</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Laferty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Culotta (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>22</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2009</year>
          . URL: https://proceedings.neurips.cc/ paper_files/paper/2009/file/8ebda540cbcc4d7336496819a46a1b68-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Tibshirani</surname>
          </string-name>
          ,
          <article-title>Regression shrinkage and selection via the lasso</article-title>
          ,
          <source>Journal of the Royal Statistical Society. Series B (Methodological) 58</source>
          (
          <year>1996</year>
          )
          <fpage>267</fpage>
          -
          <lpage>288</lpage>
          . URL: http://www.jstor.org/stable/2346178.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Hoerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. W.</given-names>
            <surname>Kennard</surname>
          </string-name>
          ,
          <article-title>Ridge regression: Biased estimation for nonorthogonal problems</article-title>
          ,
          <source>Technometrics</source>
          <volume>42</volume>
          (
          <year>2000</year>
          )
          <fpage>80</fpage>
          -
          <lpage>86</lpage>
          . URL: http://www.jstor.org/stable/1271436.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>V.</given-names>
            <surname>Bonato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vezzani</surname>
          </string-name>
          ,
          <article-title>A Novel Approach to Semic Analysis: Extraction of</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>