<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>L. López);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Development of a Biomedical Question Answering System Based on Transformer Models⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lila López</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan C. Martinez-Santos</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edwin Puertas</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad Tecnologica de Bolivar, School of Engineering</institution>
          ,
          <addr-line>Cartagena de Indias 130010</addr-line>
          ,
          <country country="CO">Colombia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Recent advances in artificial intelligence have enabled the automation of complex tasks in the biomedical domain, such as automatic question-answering. Within this framework, the BioASQ international challenge encourages the development of systems capable of understanding natural language questions and generating accurate answers based on the scientific literature. This work aims to design a system that classifies the types of questions and produces suitable responses accordingly. We implemented a modular pipeline with six main stages: (1) question type classification, (2) linguistic preprocessing, (3) Dynamic Routing and specialized model, (4) hyperparameters, (5) context retrieval, and (6) performance evaluation, including predicted type, execution time, and per-instance metrics. The system demonstrated strong performance in both the classification and answer generation tasks. In addition, a detailed analysis of each question helped identify specific errors and areas for improvement, depending on the question category.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Biomedical</kwd>
        <kwd>natural language processing</kwd>
        <kwd>automatic response generation</kwd>
        <kwd>classification</kwd>
        <kwd>evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, the rapid growth of the biomedical literature has led to significant advances. However,
it has also posed major challenges in the retrieval and synthesis of scientific knowledge. The medical
community requires intelligent systems. These systems must enable eficient access to relevant, accurate
and up-to-date information. Some authors claim that natural language questions should be the way to
do it [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        The BioASQ Challenge has been held annually since 2013. Its goal is to evaluate the ability of
automated systems to answer complex biomedical questions [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. They used natural language processing
(NLP), information retrieval, and machine learning technologies [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These initiatives have incorporated
Transformer-based models such as BigBird, BART Large CNN, and LongT5 to process long-form texts
efectively [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Retrieving biomedical information from abstracts is a growing challenge. This challenge is due
to information overload and the increasing volume of scientific literature. Biomedical experts have
reported dificulties in finding precise information within full-length documents [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        However, in QA systems that respond to user-formulated questions, some works have reported
limited performance. For example, a study using the SQuAD data set achieved a precision of 69.69 %
and an average correct response rate of 69.93 % [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>The proposed system integrates advanced natural language processing techniques, allowing the
identification of the type of question posed and the generation of coherent and relevant answers
[6]. Furthermore, the authors implemented an automatic evaluation methodology to analyze the
system’s performance in terms of accuracy, coverage, and relevance of the responses [7]. This proposal
contributes to the advancement of intelligent solutions for information retrieval in the biomedical field.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Materials and Methods</title>
      <p>Now, we present the techniques used in each methodological phase. The general architecture of the
system is also described [8]. We developed this design in the context of the BioASQ challenge.
2.1. Data
The dataset used originates from Task B of the BioASQ challenge. This task assesses the ability of
NLP models to answer biomedical questions formulated in natural language [9]. The corpus contains
approximately 5,389 questions categorized into four types: yes/no, factoid, list, and summary [10].
The distribution is as follows: 1,600 factoid questions (30%), 1,459 yes/no (27%), 1,283 summary (24%),
and 1,047 list (19%) This diversity enables evaluation of model performance on both closed and open
question answering tasks, as illustrated in Table 1.</p>
      <sec id="sec-2-1">
        <title>2.2. Methodology</title>
        <p>Total
1600
1459
1283
1047</p>
        <p>Percentages
30%
27%
24%
19%
The proposed biomedical question-answer system follows a modular architecture composed of six main
stages[11], as illustrated in Figure 1.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. Question Type Classification</title>
        <p>Each instance in the dataset was processed using a supervised classification model. The goal was to
predict its category among four possible types [7], as shown in Table 2. We used a Transformer-based
model for this task [12], specifically DistilBERT. The model was trained on a labeled set of biomedical
questions [13]. The model’s output was the identification of the appropriate category for each question.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.4. Linguistic Preprocessing</title>
        <p>First, we identified the type of question. Next, the text undergoes preprocessing. This process includes
normalization, which converts the text to lowercase, removes duplicate spaces, and cleans special
characters [14]. We also performed tokenization and truncation. These operations ensure compatibility
with downstream language models.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.5. Dynamic Routing and Specialized Models</title>
        <p>Once the question type was classified, it was dynamically routed to a specialized answering model
based on its category, as shown in Table 3. This phase employed pre-trained models adapted to specific
tasks. We used generative models for summary-type questions. We applied extractive models to factoid
and list questions. For yes/no questions, classification models were used [15].</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.6. Hyperparameters</title>
        <p>We trained the models using optimized hyperparameters to achieve the best performance. Key
parameters included base model, number of epochs, batch size, maximum sequence length, and tokenization
technique. These are summarized in Table 4.</p>
        <p>In addition, we documented the hardware environment used during the experiments. It includes the
operating system, virtual environment, RAM, CPU, and the estimated training time of the classifier
[16]. Details are also shown in Table 5.</p>
      </sec>
      <sec id="sec-2-6">
        <title>2.7. Context Retrieval</title>
        <p>For questions requiring external information (factoids, lists, summaries), we retrieved relevant context.
We performed the retrieval using a BM25-based technique. We applied this method over an index built
from biomedical articles in PubMed. The retrieved context serves as input to generate more accurate
answers.</p>
      </sec>
      <sec id="sec-2-7">
        <title>2.8. Performance Evaluation</title>
        <p>In summary, we applied question-type-specific evaluation metrics in accordance with the BioASQ
guidelines. For yes/no questions, we used accuracy and F1 score. We evaluated factoid questions using
strict and lenient accuracy. We assessed list questions with precision, recall, and F1 score. Finally, we
evaluated summary questions using ROUGE metrics (ROUGE-1, ROUGE-2, and ROUGE-L).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed System Architecture</title>
      <p>The proposed architecture for the biomedical question-answering system [17] was designed based on
the methodology shown in Figure 1. We defined hyperparameters, hardware configuration, and dataset
version. Semantic retrieval and context expansion techniques were also integrated [18]. We employed
this complete setup in the experiments conducted, as illustrated in Figure 2.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>During the evaluation phase of the proposed system, we processed a total of 50 biomedical questions
from the test set of the BioASQ Task B challenge [19]. These questions were pre-labeled according to
the four types defined by the task: yes/no, factoid, list, and summary [ 20]. The system automatically
classified the question type, retrieved the relevant context, applied specialized models—including
finetuned versions of DistilBERT, BioBERT, and BERT-SQuAD2 per type—and generated an answer [21],
which we evaluated using type-specific metrics.</p>
      <sec id="sec-4-1">
        <title>4.1. Analysis of Results</title>
        <p>
          The results from the experiments conducted for the BioASQ challenge show optimal performance for
factoid and list questions. We achieved strict accuracy and F1 scores of 1.0 in several cases. In contrast,
although yes/no questions reached 100% overall accuracy, their F1 score was 0.0 [22]. It indicates issues
related to class imbalance between "yes" and "no" answers [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. These results are detailed in Table 6.
        </p>
        <p>The results are also compared with those obtained by BioASQ participants in previous years [23].
We can observe similar performance patterns, [24]. These findings are presented in Tables 7 and 8.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Discussion of Results</title>
        <p>In this context, the results highlight the efectiveness of the proposed pipeline in both question-type
classification and answer generation, particularly for factoid and list questions, where we accurately
extracted specific information. The combination of supervised classification techniques and pre-trained
language models enabled the generation of coherent and contextually relevant responses for yes/no
and summary questions.</p>
        <p>However, the low F1 scores observed for yes/no questions reveal a limitation of the model in accurately
distinguishing between binary responses. We can attribute this issue to class imbalance during training,
as well as the fact that we do not use models explicitly fine-tuned for the biomedical domain. Similarly,
the variability in ROUGE scores for summary questions indicates that the quality of the generated
responses is highly dependent on the content and relevance of the retrieved context [25].</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>The development of the biomedical question-answering system proposed in this work demonstrates its
feasibility. We implemented a modular and specialized pipeline. It integrates multiple machine learning
models and information retrieval techniques. The system automatically classifies the type of question.
Then, it routes each instance to the appropriate model based on its category. Finally, it evaluates the
responses using type-specific metrics. In this way, the solution addresses both closed and open question
types.</p>
      <p>In summary, the developed system represents a significant contribution to the field. It advances
artificial intelligence tools applied to biomedical knowledge retrieval and understanding. This approach
opens opportunities for future applications in clinical and research settings.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Future Work</title>
      <p>We proposed a Reinforcement Learning with Human Feedback (RLHF) approach to enhance response
selection. The initial policy will rely on pre-trained generative models, such as BART or T5, which will
produce multiple candidate responses.</p>
      <p>We proposed addressing poor performance in yes/no questions by using specialized and fine-tuned
models. Train these models on balanced and enriched datasets. Use corpora such as BioASQ yes/no and
PubMedYesNo. Explore binary classification models like BioBERT or RoBERTa-bio.</p>
      <p>For summary-type questions, the goal is to enhance the coverage and fidelity of generated responses.
We propose fine-tuned encoder-decoder models such as BART or BioPEGASUS. Biomedical
multireference summary datasets, such as PubMedQA summaries or MEDIQA, will be used.</p>
    </sec>
    <sec id="sec-7">
      <title>CRediT authorship contribution statement</title>
      <p>Lila López: Methodology, data curation, system, writing, original draft. Juan C. Martinez-Santos:
Review and evaluation. Edwin Puertas: Review, formal analysis, and validation.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors used generative AI tools, specifically ChatGPT and Grammarly, to support the writing
process. These tools were employed for grammar and spelling checks, as well as for paraphrasing and
rewording parts of the text. The authors take full responsibility for the content of the paper.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>The authors express their gratitude to the Call 933 "Training in National Doctorates with a Territorial,
Ethnic and Gender Focus in the Framework of the Mission Policy — 2023" of the Ministry of Science,
Technology and Innovation (Minciencia). In addition, we thank the team of the Artificial Intelligence
Laboratory VerbaNex 1, afiliated with the UTB, for their contributions to this project.
[6] D. Weissenborn, M. Schroeder, G. Tsatsaronis, Answering complex questions with open-domain
reading comprehension systems, arXiv preprint arXiv:1906.01071 (2019). URL: https://arxiv.org/
abs/1906.01071.
[7] A. Nentidis, K. Bougiatiotis, A. Krithara, G. Paliouras, Overview of bioasq 2021: The ninth
bioasq challenge on large-scale biomedical semantic indexing and question answering, in: CEUR
Workshop Proceedings, volume 2959, 2021. URL: http://ceur-ws.org/Vol-2959.
[8] A. Mutawa, S. Sruthi, A comparative evaluation of transformers and deep learning models for
arabic meter classification, Applied Sciences 15 (2025) 4941.
[9] G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weissenborn,
A. Krithara, S. Petridis, D. Polychronopoulos, et al., An overview of the bioasq large-scale
biomedical semantic indexing and question answering competition, BMC bioinformatics 16
(2015) 1–28.
[10] J. Mey, International conference on computational linguistics, STUF-Language Typology and</p>
      <p>Universals 18 (1965) 589–592.
[11] V. Sharmila, S. Kannadhasan, A. R. Kannan, P. Sivakumar, V. Vennila, Challenges in Information,
Communication and Computing Technology: Proceedings of the 2nd International Conference on
Challenges in Information, Communication, and Computing Technology (ICCICCT 2024), April
26th &amp; 27th, 2024, Namakkal, Tamil Nadu, India, CRC Press, 2024.
[12] M. Lutfillayev, O. Narkulov, Artificial intelligence, blockchain, computing and security: Volume 2,
2023, 2, DOI: https://doi. org/10.1201/9781032684994-115 (????) 712–718.
[13] A. M. Striuk, Embracing emerging technologies: Insights from the 6th workshop for young
scientists in computer science &amp; software engineering, CEUR Workshop Proceedings, 2024.
[14] E. Martinez, J. Cuadrado, J. C. M. Santos, E. Puertas, Verbanex ai at clef exist 2024: detection of
online sexism using transformer models and profiling techniques, environments 5 (2024) 7.
[15] E. H. Yossy, D. Suhartono, A. Trisetyarso, W. Budiharto, Question classification of university
admission using named-entity recognition (ner), in: 2023 10th International Conference on
Information Technology, Computer, and Electrical Engineering (ICITACEE), IEEE, 2023, pp. 20–25.
[16] H. Dong, V. Suárez-Paniagua, W. Whiteley, H. Wu, Explainable automated coding of clinical notes
using hierarchical label-wise attention networks and label embedding initialisation, Journal of
biomedical informatics 116 (2021) 103728.
[17] Y. Yan, B.-W. Zhang, X.-F. Li, Z. Liu, List-wise learning to rank biomedical question-answer pairs
with deep ranking recursive autoencoders, PloS one 15 (2020) e0242061.
[18] A. B. Barlybayev, A. S. Mukanova, Advancements in geospatial question-answering systems:
A case study on the implementation in the kazakh language, in: 2024 IEEE 3rd International
Conference on Problems of Informatics, Electronics and Radio Engineering (PIERE), IEEE, 2024,
pp. 1710–1715.
[19] M. Sarrouti, S. O. El Alaoui, Sembionlqa: A semantic biomedical question answering system for
retrieving exact and ideal answers to natural language questions, Artificial intelligence in medicine
102 (2020) 101767.
[20] A. Nentidis, G. Katsimpras, A. Krithara, G. Paliouras, Overview of bioasq tasks 12b and synergy12
in clef2024, Working Notes of CLEF 2024 (2024).
[21] K. Khelil, G. Besbes, H. Baazaoui-Zghal, Semantic question answering: Deep learning and nosql
solution for the medical domain, in: 2024 IEEE International Conference on Big Data (BigData),
IEEE, 2024, pp. 6486–6493.
[22] Y. Du, Q. Li, L. Wang, Y. He, Biomedical-domain pre-trained language model for extractive
summarization, Knowledge-Based Systems 199 (2020) 105964.
[23] A. Nentidis, G. Katsimpras, A. Krithara, M. Krallinger, M. R. Ortega, N. Loukachevitch,
A. Sakhovskiy, E. Tutubalina, G. Tsoumakas, G. Giannakoulas, et al., Bioasq at clef2025: The
thirteenth edition of the large-scale biomedical semantic indexing and question answering challenge,
in: European Conference on Information Retrieval, Springer, 2025, pp. 407–415.
[24] M. Sarrouti, D. Gupta, A. B. Abacha, D. Demner-Fushman, Nlm at bioasq synergy 2021: Deep
learning-based methods for biomedical semantic question answering about covid-19., in: CLEF
(Working Notes), 2021, pp. 335–350.
[25] M.-T. C. Evans, M. Latifi, M. Ahsan, J. Haider, Leveraging semantic text analysis to improve the
performance of transformer-based relation extraction, Information 15 (2024) 91.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Sadhuram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soni</surname>
          </string-name>
          ,
          <article-title>Natural language processing based new approach to design factoid question answering system</article-title>
          ,
          <source>in: 2020 Second International Conference on Inventive Research in Computing Applications (ICIRCA)</source>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>276</fpage>
          -
          <lpage>281</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Tsatsaronis</surname>
          </string-name>
          , G. Balikas,
          <string-name>
            <given-names>P.</given-names>
            <surname>Malakasiotis</surname>
          </string-name>
          , I. Partalas,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zschunke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Alvers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petridis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Gaussier</surname>
          </string-name>
          , et al.,
          <article-title>An overview of the bioasq large-scale biomedical semantic indexing and question answering competition</article-title>
          ,
          <source>BMC bioinformatics 16</source>
          (
          <year>2015</year>
          )
          <article-title>138</article-title>
          . doi:
          <volume>10</volume>
          .1186/s12859-015-0564-6.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R. Devendra</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Srihari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Arvind</surname>
          </string-name>
          , W. Viriyasitavat,
          <article-title>Biomedical event extraction on input text corpora using combination technique based capsule network</article-title>
          ,
          <source>Sa¯dhana¯ 47</source>
          (
          <year>2022</year>
          )
          <fpage>198</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I.</given-names>
            <surname>Naveed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wasim</surname>
          </string-name>
          ,
          <article-title>Ideal answer generation for biomedical questions using abstractive summarization</article-title>
          ,
          <source>in: 2023 25th International Multitopic Conference (INMIC)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <article-title>Pre-trained language model for biomedical question answering</article-title>
          ,
          <source>in: Proceedings of the BioNLP Workshop</source>
          <year>2020</year>
          , Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>79</fpage>
          -
          <lpage>85</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .bionlp-
          <volume>1</volume>
          .
          <fpage>10</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>