<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LHS712Team-1 at eRisk@ CLEF 2025: Searching for Depression Symptoms Using Various Natural Language Processing Algorithms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aisha Benloucif</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yashasvini Nannapuraju</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sripriya Bellam</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuyan Hu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhe Zhao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>V.G.Vinod Vydiswaran</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Learning Health Sciences, University of Michigan</institution>
          ,
          <addr-line>Ann Arbor, MI 48109</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Information, University of Michigan</institution>
          ,
          <addr-line>Ann Arbor, MI 48109</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Depression is one of the most common mental health problems worldwide, afecting more than 280 million people. Early recognition and intervention are crucial to prevent severe consequences. This study participated in Task 1 of CLEF eRisk 2025, focusing on ranking the relevance of user-generated content to symptoms defined in the Beck Depression Inventory (BDI). Using various machine learning models and natural language processing techniques, including logistic regression (LR), Support Vector Machine(SVM), and BERT-based models, we aimed to advance early detection of depressive symptoms in online text, ofering new tools for future mental health early prevention.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Depression</kwd>
        <kwd>Early Detection</kwd>
        <kwd>Beck Depression Inventory (BDI)</kwd>
        <kwd>Natural Language Processing (NLP)</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>BERT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Depression is a mental disorder that has a staggering impact globally, afecting more than 280 million
people, as noted by the World Health Organization [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Depression manifests itself as an array of
symptoms, such as low mood, low interest in activities (anhedonia), sleep disturbances, alterations
in appetite, etc. If not identified and addressed promptly, depression can severely afect one’s ability
to learn, work, and perform other daily activities, and in extreme cases lead to self-harm or suicidal
actions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Therefore, there is a critical need for accurate early recognition and intervention strategies
to advance mental health research and public health.
      </p>
      <p>
        The rise of social media and digital communication platforms has created an opportunity for mental
health researchers to analyze user-generated content in real time. Unlike traditional clinical assessments,
which are based on self-reports or structured interviews, online expressions such as tweets, Facebook
posts, and forum discussions provide researchers with unique insight into individuals’ emotional and
psychological states [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. These data are particularly valuable because they capture everyday contexts
in natural language, revealing subtle linguistic markers of distress, such as increased use of negative
emotion words, first-person pronouns, or increased ruminations [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        In recent years, a growing body of literature has explored the use of natural language processing
(NLP) techniques to assess mental health from online content. For example, Coppersmith et al. used NLP
methods to detect linguistic signals related to mental health from Twitter posts, showing how social
media text can help identify mental health concerns [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Machine learning (ML) has greatly advanced
the detection of mental health signals from text data. Traditional methods like Logistic Regression (LR)
used features such as n-grams and sentiment lexicons to classify text. Recently, pre-trained language
models (PLMs) like Bidirectional Encoder Representations from Transformer (BERT) have enhanced
NLP by capturing word context, allowing for better understanding of emotional and psychological
nuances. BERT’s ability to process large datasets and contextualize language has made sentiment
analysis more accurate [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        This paper summarizes our participation in Task 1 of CLEF eRisk 2025 [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], a task aimed at ranking
sentences from user writings on social media by their relevance to the 21 symptoms defined in the
Beck Depression Inventory (BDI). The challenge of this task is that some sentences do not overtly
express negative emotions. By applying various ML techniques, we examined their efectiveness in
ranking sentences according to their relevance to depression symptoms. Our results propel further
interdisciplinary inquiry at the intersection of NLP and mental health.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Task 1: Search for Symptoms of Depression</title>
      <sec id="sec-2-1">
        <title>2.1. Task Description</title>
        <p>This task focuses on detecting depression signals in user writings on social media. Participants are
asked to build a system that could rank provided sentences for each of the 21 symptoms of depression
from the Beck Depression Inventory II (BDI-II) questionnaire. A sentence is considered relevant if it
contains the signal about the user’s condition of a particular symptom.</p>
        <p>
          The oficial TREC-formatted annotated corpus from the eRisk shared tasks 2023 and 2024 were
provided [
          <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
          ]. Each sentence in these data sets was annotated for relevance to one or more of the 21
symptoms listed in BDI-II. Two annotation schemes were provided: majority vote (agreement by at
least two annotators) and full consensus (unanimous agreement). This corpus involved 9,000 social
media users with a total of 17,553,441 sentences. The average number of words per sentence is 12.39.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Methods</title>
        <p>
          In this study, a range of machine learning techniques were employed to analyze textual data and
perform binary classification of sentence relevance, ranging from classical machine learning approaches
to more advanced transformer-based deep learning models. The experimental design consisted of
multiple run groups, each applying distinct vectorization and modeling strategies. Initial experiments
used logistic regression and support vector machine (SVM) classifiers paired with CountVectorizer and
TF-IDF vectorization, respectively. In subsequent experiments, more advanced feature representations,
such as ClinicalBERT and Sentence-BERT (SBERT) embeddings, were incorporated to better capture
semantic context. Models were evaluated using standard classification metrics, including precision,
recall, F1-score, and computational eficiency. The final experiments explored a hybrid retrieval approach
that combined BM25 lexical matching [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] with semantic re-ranking using SBERT, aiming to improve
relevance ranking by leveraging both lexical overlap and contextual meaning.
        </p>
        <p>During the training process, the datasets were merged and split into train and test sets in an 80/20
ratio. Relevance labels were binarized, assigning a value of 1 to relevant sentences and 0 to non-relevant
ones. No external datasets were used during training or evaluation.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <sec id="sec-3-1">
        <title>3.1. Submitted Runs</title>
        <p>
          We submitted five runs in total in the requested TREC format: “result” (LR with CountVectorizer), “LR file
combined” (LR with TF-IDF representations), “SVM file combined” (SVM with TF-IDF Representations),
“BERT CONSENSUS” and “BERT MAJORITY”. Two types of evaluation schema were used to determine if
sentences were correctly ranked to a symptom or not: majority-based (a sentence was deemed relevant
if at least two of the three assessors marked it so) and unanimity-based (a sentence was deemed relevant
only when all three assessors agreed) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          From both results table 1 and 2, the best-performing approaches involved fine-tuning the BERT [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]
model. For each training instance, input was constructed in the format “[SYMPTOM] [SEP] sentence”,
allowing the model to learn the relationship between a depression symptom and a candidate sentence.
Separate models were fine-tuned for both majority-vote and consensus-labeled datasets to enable
comparative evaluation. The training was performed over 5 epochs, using a batch size of 16, a learning
rate of 2e-5, and the AdamW optimizer with weight decay. Model checkpoints were saved at the end
of each epoch. The fine-tuned model learned to classify the relevance of sentences to specific BDI-II
symptoms with high accuracy.
        </p>
        <p>Given the scale of the eRisk 2025 test set (over 17 million unlabeled sentences), a symptom-aware
keyword filtering pipeline was developed to pre-filter candidate sentences. This pipeline used
symptomspecific keyword lists to reduce computational burden and improve inference speed. Filtered sentences
were paired with their corresponding symptom prompts and passed to the BERT model for scoring. For
each symptom, the top 1,000 sentences with the highest predicted relevance scores were selected and
formatted according to TREC submission standards. The BERT model fine-tuned on unanimity-based
labels achieved the highest scores on all five metrics in both ranking-based evaluation schema.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Unsubmitted Runs</title>
        <sec id="sec-3-2-1">
          <title>3.2.1. SVM with ClinicalBERT Embeddings</title>
          <p>We tried fitting SVM classifier with ClinicalBERT [ 10] embeddings, which is pretrained on clinical
corpora. This approach achieved a slightly higher accuracy of 0.79 on the validation set, with strong
performance for non-relevant sentences (F1 = 0.85) but moderate efectiveness for relevant ones (F1
= 0.63). The macro F1-score was 0.74, and the weighted F1-score was 0.79. Although ClinicalBERT
provided richer semantic context, SVM with TF-IDF representations ofered comparable results at
a significantly lower computational cost, making it a viable alternative for large-scale or
resourceconstrained applications.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. SVM/LR with SBERT</title>
          <p>In this experiment, Sentence-BERT (SBERT) [11] was adopted, ofering semantically meaningful sentence
embeddings optimized for short text classification. Two classifiers – Support Vector Classifier (SVC) and
Logistic Regression (LR) [12, 13] – were trained using embeddings generated by SBERT. The training
data was derived from a merged 2023–2024 dataset, with sentence texts serving as input features and
the binary “relevance” label as the output. To ensure consistency, the same 80/20 train-validation split
from earlier experiments was used. Both SVC and LR models were trained on identical SBERT-based
features to facilitate a fair comparison.</p>
          <p>Despite SBERT’s semantic strength and computational eficiency, both classifiers achieved only
moderate performance. Accuracy for both models were about 77% on the validation set, with strong
precision and recall for the non-relevant class but lower metrics for the relevant class. These results
suggest that while SBERT captures contextual meaning efectively, further optimization is required,
particularly to address class imbalance and fine-tune model parameters.
3.2.3. BM25 with SBERT
A hybrid retrieval framework was implemented that combined traditional BM25 lexical retrieval with
semantic re-ranking using a pre-trained Sentence-BERT model. The two-step approach was designed
to leverage BM25’s strength in lexical term matching while addressing its limitations in semantic
understanding through contextualized embeddings. While BM25 performs efectively in retrieving
documents with overlapping vocabulary, it often fails to capture semantically related candidates phrased
diferently – an area where embedding-based models demonstrate superior performance.</p>
          <p>NDCG</p>
          <p>The evaluation on the validation set demonstrated a weighted precision of 0.47, recall of 0.34, and an
F1-score of 0.39. BM25 retrieval alone produced strong recall by retrieving highly relevant documents,
whereas precision improved significantly after applying Sentence-BERT-based re-ranking. However,
precision (0.09) and recall (0.18) for Class 1 (relevant) sentences remained notably lower than for
Class 0 (non-relevant) sentences, indicating that while semantic matching improved, the consistent
identification of truly relevant examples remains a challenge.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Discussion</title>
        <p>The ranking evaluation demonstrates that fine-tuned BERT models consistently outperform traditional
classifiers and unsupervised methods in retrieving clinically relevant sentences. Notably, the BERT
models trained on unanimity-based labels achieve higher ranking scores across all metrics compared to
those trained on majority-based labels. This suggests that when the labeling task is easier for human
experts and/or they agree, the examples yield clearer relevance signals, enabling the model to rank
pertinent sentences more efectively. Traditional machine learning classifiers such as logistic regression
and SVM perform close to baseline levels in ranking metrics, highlighting their limited ability to capture
nuanced clinical relevance for this task. The hybrid BM25 + SBERT approach improves lexical-semantic
matching but still struggles with precision and recall, underscoring the importance of deep contextual
understanding that transformer models like BERT provide. Overall, these results emphasize the value
of transformer-based models trained on consensus labels for achieving improved retrieval performance
in clinical sentence ranking tasks.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>The results suggest that BERT outperformed traditional models such as logistic regression (LR) with
term-frequency or TF-IDF features, primarily due to its ability to understand contexts within language.
BERT’s superiority is likely attributable to its transformer-based architecture, which enables nuanced
comprehension beyond the keyword-level focus of term-frequency based features. Among the two
traditional vectorization methods, TF-IDF yielded better results than term-frequency, emphasizing the
importance of weighting term, though both approaches fell short in capturing semantic meaning.</p>
      <p>Despite these promising findings, the study had several limitations. A key issue was the imbalanced
labels: sentences labeled as “relevant” (Label 1) were underrepresented, negatively impacting the
performance of LR in particular. Future research could address this limitation through oversampling
techniques or contrastive learning methods to enhance minority-class representation. BERT, while
efective, also introduced scalability concerns due to its high computational cost, which constrained the
number of models and hyperparameter combinations that could be evaluated.</p>
      <p>This supports BERT’s potential for high-impact use cases where contextual understanding is essential.
As NLP methods for classification tasks continue to evolve, integrating advanced NLP models like BERT
or domain-specific variations, such as Opinion-BERT [ 14], which has been applied in mental health
analysis to detect nuanced sentiment and psychological states, ofers a path toward deeper and more
reliable insights [15].</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During manuscript preparation, the authors used ChatGPT-4o for grammar and spelling check, then
reviewed and edited the content as needed. They take full responsibility for the publication’s content.
[10] E. Alsentzer, J. Murphy, W. Boag, W.-H. Weng, D. Jindi, T. Naumann, M. McDermott, Publicly
available clinical BERT embeddings, in: A. Rumshisky, K. Roberts, S. Bethard, T. Naumann
(Eds.), Proceedings of the 2nd Clinical Natural Language Processing Workshop, Association
for Computational Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 72–78. URL: https://
aclanthology.org/W19-1909/. doi:10.18653/v1/W19-1909.
[11] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing,
Association for Computational Linguistics, 2019. URL: https://arxiv.org/abs/1908.10084.
[12] D. R. Cox, The regression analysis of binary sequences, Journal of the Royal Statistical Society:</p>
      <p>Series B (Methodological) 20 (1958) 215–232.
[13] C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20 (1995) 273–297.
[14] M. M. Hossain, M. S. Hossain, M. F. Mridha, M. Safran, S. Alfarhood, Multi task opinion enhanced
hybrid BERT model for mental health analysis, Scientific Reports 15 (2025) 3332. URL: https://
www.nature.com/articles/s41598-025-86124-6. doi:10.1038/s41598-025-86124-6, publisher:
Nature Publishing Group.
[15] A. Gaurav, B. B. Gupta, K. T. Chui, BERT Based Model for Robust Mental Health Analysis in
Clinical Informatics, in: 2024 21st International Joint Conference on Computer Science and
Software Engineering (JCSSE), Phuket, Thailand, 2024, pp. 153–160. doi:10.1109/JCSSE61278.
2024.10613729.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>World</given-names>
            <surname>Health</surname>
          </string-name>
          <string-name>
            <surname>Organization</surname>
          </string-name>
          ,
          <source>Depressive disorder (depression)</source>
          (
          <year>2023</year>
          ). URL: https://www.who.int/ news-room/fact-sheets/detail/depression, retrieved March 31,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>M. De Choudhury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Gamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Counts</surname>
          </string-name>
          , E. Horvitz,
          <article-title>Predicting depression via social media</article-title>
          ,
          <source>in: Proceedings of the International AAAI Conference on Web and Social Media</source>
          , volume
          <volume>7</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>128</fpage>
          -
          <lpage>137</lpage>
          . URL: https://doi.org/10.1609/icwsm.v7i1.14432. doi:
          <volume>10</volume>
          .1609/icwsm.v7i1.
          <fpage>14432</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Eichstaedt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Merchant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Ungar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Crutchley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Preoţiuc-Pietro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Asch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <article-title>Facebook language predicts depression in medical records</article-title>
          ,
          <source>Proceedings of the National Academy of Sciences of the United States of America</source>
          <volume>115</volume>
          (
          <year>2018</year>
          )
          <fpage>11203</fpage>
          -
          <lpage>11208</lpage>
          . doi:
          <volume>10</volume>
          .1073/pnas.1802331115, epub 2018 Oct 15.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Coppersmith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dredze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Harman</surname>
          </string-name>
          ,
          <article-title>Quantifying mental health signals in Twitter</article-title>
          , in: P. Resnik,
          <string-name>
            <given-names>R.</given-names>
            <surname>Resnik</surname>
          </string-name>
          , M. Mitchell (Eds.),
          <source>Proceedings of the Workshop on Computational Linguistics</source>
          and
          <article-title>Clinical Psychology: From Linguistic Signal to Clinical Reality, Association for Computational Linguistics</article-title>
          , Baltimore, Maryland, USA,
          <year>2014</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>60</lpage>
          . URL: https://aclanthology.org/W14-3207/. doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>W14</fpage>
          -3207.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Garrido-Merchan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gozalo-Brizuela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gonzalez-Carvajal</surname>
          </string-name>
          ,
          <article-title>Comparing bert against traditional machine learning models in text classification</article-title>
          ,
          <source>Journal of Computational and Cognitive Engineering</source>
          <volume>2</volume>
          (
          <year>2023</year>
          )
          <fpage>352</fpage>
          -
          <lpage>356</lpage>
          . URL: https://ojs.bonviewpress.com/index.php/JCCE/article/view/838. doi:
          <volume>10</volume>
          .47852/bonviewJCCE3202838.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk 2025:
          <article-title>Early risk prediction on the internet (extended overview)</article-title>
          ,
          <source>in: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2025</year>
          ), Madrid, Spain,
          <fpage>9</fpage>
          -
          <issue>12</issue>
          <year>September</year>
          ,
          <year>2025</year>
          , CEUR Workshop Proceedings, CEURWS.org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk 2025:
          <article-title>Early risk prediction on the internet, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction - 16th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2025</year>
          , Madrid, Spain, September 9-
          <issue>12</issue>
          ,
          <year>2025</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , Lecture Notes in Computer Science, Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Amati</surname>
          </string-name>
          , BM25,
          <string-name>
            <surname>Springer</surname>
            <given-names>US</given-names>
          </string-name>
          , Boston, MA,
          <year>2009</year>
          , pp.
          <fpage>257</fpage>
          -
          <lpage>260</lpage>
          . URL: https://doi.org/10.1007/ 978-0-
          <fpage>387</fpage>
          -39940-9_
          <fpage>921</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-0-
          <fpage>387</fpage>
          -39940-9_
          <fpage>921</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.),
          <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>Association for Computational Linguistics</source>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https://aclanthology.org/N19-1423/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1423.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>