<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>VA-BO-INTERN: Adapting Small Language Models to Low-Resource Domains: A Case Study in Hindi Tourism QA</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sandipan Majhi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paheli Bhattacharya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bosch Research and Technology Centre</institution>
          ,
          <addr-line>Bangalore</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Indian Institute of Technology Kharagpur</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Domain-specific question answering in low-resource languages faces two key challenges: scarcity of annotated datasets and limited domain knowledge in general-purpose language models. In this work, we present a multistage finetuning strategy to adapt lightweight language models to the Hindi tourism domain by leveraging both original and synthetic training data. Synthetic question-answer pairs are generated using large LLMs (LLaMA70B, Phi-14B) and used to augment the limited original dataset. We explore several training methodologies and analyze their impact on domain generalization. Our results demonstrate that large models can eficiently generate synthetic data, while small models can efectively adapt to it, ofering a scalable pathway for low-resource, domain-specific QA.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Question Answering</kwd>
        <kwd>Synthetic Data Generation</kwd>
        <kwd>Small Language Model Finetuning</kwd>
        <kwd>Indic NLP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Large language models (LLMs) have significantly advanced natural language generation, understanding,
and reasoning. Despite their success, adapting these models to domain-specific applications remains
challenging due to two main factors: (i) general-purpose LLMs often lack specialized domain knowledge,
and (ii) high-quality annotated datasets are scarce and expensive to obtain. The cost and time demands
of manual annotation have therefore driven interest in synthetic data as a scalable alternative.</p>
      <p>
        LLMs, owing to their broad pre-training, can act as efective knowledge bases [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] and have been
shown to produce high-quality synthetic question–answer (QA) pairs [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]. Synthetic datasets have
further demonstrated utility in addressing the limitations of low-resource domains [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], enabling
the creation of specialized training resources that would otherwise be infeasible. This has opened a
practical avenue for domain adaptation, especially in fields where curated open-source datasets are
extremely limited.
      </p>
      <p>At the same time, the emergence of lightweight language models provides new opportunities for
eficient domain adaptation. Smaller models are cheaper to finetune, faster at inference, and easier to
deploy in resource-constrained environments. While very large LMs are well-suited for generating
synthetic training data, compact models are more practical for downstream deployment. Thus,
combining synthetic data generation from large models with targeted finetuning of smaller ones represents a
promising strategy for building efective, domain-specific QA systems.</p>
      <p>In this work, we investigate this paradigm in the context of Hindi tourism, a domain where both
language resources and annotated datasets are limited. We generate synthetic QA pairs using large
LMs (LLaMA-70B and Phi-14B) and finetune a smaller model (LLaMA-8B) to evaluate its performance.
Beyond simple finetuning, we explore mixed-training methodologies that combine synthetic and
generaldomain data, analyzing their efect on robustness and domain generalization. Our contributions are
threefold: (i) a data augmentation strategy tailored to low-resource, domain-specific QA, (ii) empirical
evidence that synthetic data can efectively adapt lightweight models, and (iii) a comparative analysis
of training strategies to identify setups that balance eficiency with domain performance.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Generating high-quality synthetic question answers using large language models (LLMs) has been
a key focus of recent research [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]. Studies done by Chia et al. [8] and Liu et al. [9] have shown
that zero-shot prompting can be highly efective for creating high-quality, structured synthetic data.
However, generating synthetic question-answer pairs can sometimes result in unintended redundancy.
Studies such as Yadav et al. [10] suggest that exploring diferent sampling techniques could introduce
greater diversity, which may be beneficial for downstream tasks. The primary goal of generating
synthetic question-answer pairs is to improve model performance on question-answering tasks. Prior
work by Chowdhury and Chadha [11] demonstrated how synthetic data, particularly from
"in-thewild" sources, can lead to performance gains and help achieve natural distribution shifts. Similarly,
Kramchaninova and Defauw [12] validated the efectiveness of combining synthetic data with original
training data, showing that this approach consistently outperforms models trained exclusively on
nonsynthetic data, especially on domain-specific test sets. Another study by Harsha et al. [13] on the use of
synthetic data within the financial domain confirmed its efectiveness in boosting question-answering
performance in specialized fields.
      </p>
      <p>
        To achieve performance improvements on downstream question-answering tasks, several studies
have investigated diferent methods for finetuning models using a combination of synthetic and
original training datasets. Namboori et al. [14] proposed a finetuning approach that involves first
training on the synthetic data and then on the original training set, arguing that a model should
perform better if it is well-conditioned to a high-quality dataset. Conversely, Chada and Natarajan
[15] showed performance improvements by finetuning first on the original training data and then on a
small, additional amount of synthetic data. Other studies, including [
        <xref ref-type="bibr" rid="ref5">16, 17, 5</xref>
        ], also demonstrate that
continued finetuning on a small amount of synthetic data can lead to a significant performance uplift
in question answering. A study by Gurgurov et al. [18] utilized synthetic data curation by translating
English data into other low-resource languages and performing continued pretraining, illustrating how
synthetic data can aid in model alignment for new domains.
      </p>
      <p>Researchers have already created several benchmarks based on synthetic datasets, particularly
for low-resource domains. The Indic-QA Benchmark [19] used synthetic data generation techniques
to create question-answering datasets in 11 Indian languages. Similarly, IndiSentiment140 [20] is
another such dataset that uses machine translation to generate sentiment analysis datasets across 22
Indian languages. The IndicXTREME Benchmark [21] also leveraged machine translation to create new
synthetic datasets from existing English data.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This study investigates the impact of synthetic data augmentation on the performance of small language
models for long-form question-answering task.</p>
      <p>Synthetic Data Generation: As shown in Table 1, we utilize contexts from the training set to generate
additional question-answer pairs. To achieve this, we employ larger LLMs to create a new corpus of
question answer pairs using training set contexts using few shot prompts.</p>
      <p>Finetuning SLMs: We train SLMs using the synthetic data and available training data. A comprehensive
representation of the workflow has been presented in Figure 1. We follow a mixed-training strategy:
∙ Baseline Model: We first finetune a small language model on the already available data, to get the
baseline finetuned model.
∙ Continued Finetuning: We use the synthetic data and continue finetuning the baseline finetuned
model, to produce a continued finetuned model.
∙ Multi-Source Finetuning: In this setting, we use all of the available training data and the augmented
synthetic data to finetune the SLM.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Dataset</title>
      <p>This study utilizes the Varanasi Tourism in Question Answer System (VATIKA) dataset published by
Gatla et al. [22], a publicly available resource in Forum for Information Retrieval Evaluation (FIRE)
2025. This Hindi-language dataset consists of instances where each context is paired with one or more
related question-answer pairs. The answers were typically long-form and abstractive in nature, utilizing
the provided context as their source of information. The originally published dataset had three splits,
namely, train, validation and Test Data-1. There is a held-out test set which was provided as a part of
the shared task. It only had contexts and questions and not the gold standard answers. We refer this
dataset as Test Data-2. To provide a comprehensive overview of the dataset we present its key statistics
in Table 1. We see that on average the question and answer lengths in the VATIKA dataset is 13 and
16 words respectively and there are about 2-3 question-answer pairs per contexts. The synthetic data
on average produces approximately 3 more QA pairs on average and has similar question and answer
length distribution.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Settings</title>
      <p>Synthetic Data Generation: For generating synthetic data, LLAMA-3.1-70B[23] and Phi-4-14B[24]
were utilized in few shot prompt format. The new corpus of question-answer pairs contained around
4,000 instances generated by LLAMA-3.1-70B and 33,000 instances generated by Phi-4-14B. We use
 = 0.7 and  −  = 0.9 for synthetic data generation for both the models.
Model Finetuning: We finetune LLAMA-3.1-8B[ 23] using the strategies described in Section 3. We
experiment with three distinct model configurations as follows. The hyperparameters are in Table 2.
∙ M1: Baseline Model: LLAMA-3.1-8B was exclusively fine-tuned for 4 epochs on the original 13,092
training instances.
∙ M2: Continued Fine-Tuning: The 2 epoch trained baseline model (M1) underwent a second phase of
ifne-tuning for another 2 epochs on a 33,000-instance synthetic dataset generated by Phi-4-14B.
∙ M3: Multi-Source Fine-Tuning: This model was fine-tuned for 4 epochs on a combined 50,000-instance
dataset, which included the original 13,000 training instances along with synthetic data from two distinct
large language models: 33,000 instances from Phi-4-14B and 4,000 instances from LLAMA-3.1-70B.</p>
      <p>Parameter</p>
      <p>Value
Max sequence length 4096
Per-device train batch size 2
Gradient accumulation steps 4
Warmup steps 5
Learning rate scheduler type "cosine"</p>
      <p>Number of epochs 4</p>
      <p>Evaluation: We report our model’s performance using token-based metrics, ROUGE-L1 and BLEU2. In
our implementation of the ROUGE-L scores presented in Table 3 we modify of the default tokenization
function to incorporate Hindi words and characters. For the semantics-based metric, BERTScore 3 over
predicted answers and gold answers on validation and test splits.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Results and Analysis</title>
      <p>In this section, we first provide our evaluation of the diferent training strategies on Validation and Test
Data-1. Then, we present the organizers’ evaluation on Test Data-2.</p>
      <p>
        Validation and Test Data-1: Our experiments presented in Table 3, revealed several key findings
regarding model training and distribution robustness. First, as shown in Table 3, the model trained only
on the original data (M1[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) performed best on the development set, while the model with combined
original and synthetic data (M3[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) excelled on Test Data-1. This disparity highlights the models’
sensitivity to distribution shifts.
      </p>
      <p>
        However, the combined multi-source model (M3[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) underperformed in BLEU-2, which is a precision
based metric. A potential reason is that combining multi-source data may introduce conflicting answers
for similar questions, leading to ambiguity and performance degradation.
      </p>
      <p>
        Held-out Test Data-2: Table 4 demonstrates our method’s performance on the proprietary and
undisclosed Test Data-2 split. Our model configurations has consistent top rankings with M2[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
outperforming other models in QA-F1, a second-place ranking in BLEU-2, a third-place ranking in
BLEU-1 and fourth-place ranking in ROUGE-1 and ROUGE-2. The results indicate that supplementing
models with a large quantity of high-quality synthetic data can not only improve performance on
downstream tasks but also significantly enhance their robustness to unseen data.
      </p>
      <p>
        A crucial insight comes from the two-stage trained model (M2[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). Despite its moderate performance
on Test Data-1, it achieved superior BLEU-2 and QA-F1 scores on the held-out Test Data-2 (Table 4).
This suggests that late exposure to synthetic data is efective for building distribution robustness.
Analysing the Synthetic data: Table 1 outlines the quantitative diferences between the original and
synthetic datasets, while Table 5 presents a qualitative comparison of the question-answer (QA) pairs
1https://huggingface.co/spaces/evaluate-metric/rouge
2https://huggingface.co/spaces/evaluate-metric/bleu
3https://huggingface.co/spaces/evaluate-metric/bertscore
M1
M2
M3
      </p>
      <p>Orig 13k
M1 + 33k
All 50k</p>
      <p>Rouge-L
they generated. The outputs from LLAMA-3.1-70B and Phi-4-14B demonstrate considerable overlap,
underscoring the critical importance of a data selection stage to filter for quality. A potential direction
for future research in developing such quality checks is the joint evaluation of both question and answer
within each synthetic pair.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Future Work</title>
      <p>In this work, we proposed a multi-stage finetuning strategy for lightweight language models in the
Hindi tourism domain, leveraging both original and synthetic training data. Models trained with
continued finetuning-first on original data, then on synthetic data-consistently outperformed alternative
approaches. This staged exposure allows the model to retain grounding in authentic data while benefiting
from the scale and diversity of synthetic examples, improving robustness in domain-specific question
answering. We also found that indiscriminate or excessive mixing of multi-source synthetic data can
degrade performance, highlighting the importance of careful curation and controlled integration in
low-resource settings.</p>
      <p>As a future work our approach can be extended to other low-resource languages to test its
generalizability. Future work also includes systematic evaluation of synthetic data quality, potentially
using LLM-based filtering methods. Overall, this study demonstrates that large models can generate
synthetic data, but small models can efectively adapt to it, enabling scalable and robust QA systems in
low-resource domains.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Gemini-Flash 2.5 in order to rectify: Grammar,
spelling check and evaluate microstructure. After using these tool(s)/service(s), the author(s) reviewed
and edited the content as needed and take(s) full responsibility for the publication’s content.
Context
LLAMA-3.1-70B
Phi-4-14B
JMIR Formative Research 8 (2024) e52164.
[8] Y. K. Chia, L. Bing, S. Poria, L. Si, RelationPrompt: Leveraging prompts to generate synthetic data
for zero-shot relation triplet extraction, in: Findings of the ACL, 2022, pp. 45–57.
[9] S. Liu, Y. Li, J. Li, S. Yang, Y. Lan, Unleashing the power of large language models in
zeroshot relation extraction via self-prompting, in: Findings of the Association for Computational
Linguistics: EMNLP 2024, Miami, Florida, USA, 2024, pp. 13147–13161.
[10] V. Yadav, H. j. Kwon, V. Srinivasan, H. Jin, Explicit over implict: Explicit diversity conditions for
efective question answer generation, in: LREC-COLING 2024, 2024, pp. 6876–6882.
[11] A. Chowdhury, A. Chadha, Generative data augmentation using LLMs improves distributional
robustness in question answering, in: Proceedings of the 18th Conference of the European Chapter
of the Association for Computational Linguistics: Student Research Workshop, 2024, pp. 258–265.
[12] A. Kramchaninova, A. Defauw, Synthetic data generation for multilingual domain-adaptable
question answering systems, in: Proceedings of the 23rd Annual Conference of the European
Association for Machine Translation, 2022, pp. 151–160.
[13] C. Harsha, K. S. Phogat, S. Dasaratha, S. A. Puranam, S. Ramakrishna, Synthetic data generation
using large language models for financial question answering, in: Proceedings of the Joint
Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th
Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance
and Legal (LLMFinLegal), Abu Dhabi, UAE, 2025, pp. 76–95.
[14] A. Namboori, S. Mangale, A. Rosenbaum, S. Soltan, Gemquad: Generating multilingual question
answering datasets from large language models using few shot learning, arXiv e-prints (2024)
arXiv–2404.
[15] R. Chada, P. Natarajan, FewshotQA: A simple framework for few-shot learning of question
answering tasks using pre-trained text-to-text models, in: EMNLP, 2021, pp. 6081–6090.
[16] X. Chen, J.-Y. Jiang, W.-C. Chang, C.-J. Hsieh, H.-F. Yu, W. Wang, MinPrompt: Graph-based
minimal prompt data augmentation for few-shot question answering, in: Proceedings of the 62nd
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024,
pp. 254–266.
[17] A. Ushio, F. Alva-Manchego, J. Camacho-Collados, An empirical comparison of LM-based question
and answer generation methods, in: Findings of the ACL, Toronto, Canada, 2023, pp. 14262–14272.
[18] D. Gurgurov, M. Hartmann, S. Ostermann, Adapting multilingual LLMs to low-resource languages
with knowledge graphs via adapters, in: Proceedings of the 1st Workshop on Knowledge Graphs
and Large Language Models (KaLLM 2024), 2024, pp. 63–74.
[19] A. K. Singh, V. Kumar, R. Murthy, J. Sen, A. Mittal, G. Ramakrishnan, INDIC QA BENCHMARK: A
multilingual benchmark to evaluate question answering capability of LLMs for Indic languages,
in: Findings of NAACL, Albuquerque, New Mexico, 2025, pp. 7689–7698.
[20] S. Kumar, R. Sanasam, S. Nandi, IndiSentiment140: Sentiment analysis dataset for Indian languages
with emphasis on low-resource languages using machine translation, in: NAACL, 2024, pp.
7689–7698.
[21] S. Doddapaneni, R. Aralikatte, G. Ramesh, S. Goyal, M. M. Khapra, A. Kunchukuttan, P. Kumar,
Towards leaving no Indic language behind: Building monolingual corpora, benchmark and models
for Indic languages, in: ACL, 2023, pp. 12402–12426.
[22] P. Gatla, Anushka, N. Kanwar, G. Sahoo, R. K. Mundotiya, Tourism question answer system in
indian language using domain-adapted foundation models, arXiv preprint (2025).
[23] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang,</p>
      <p>A. Fan, et al., The llama 3 herd of models, CoRR (2024).
[24] M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M.
Javaheripi, P. Kaufmann, et al., Phi-4 technical report, arXiv preprint arXiv:2412.08905 (2024).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Dong</surname>
          </string-name>
          , Head-to-tail:
          <article-title>How knowledgeable are large language models (LLMs)? A.K.A. will LLMs replace knowledge graphs?</article-title>
          ,
          <source>in: NAACL</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>325</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Can generative pre-trained language models serve as knowledge bases for closed-book qa?</article-title>
          ,
          <source>in: ACL-IJCNLP</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>3241</fpage>
          -
          <lpage>3251</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Scaria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Chenna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Subramani</surname>
          </string-name>
          ,
          <article-title>How good are Modern LLMs in generating relevant and high-quality questions at diferent bloom's skill levels for Indian high school social science curriculum?</article-title>
          ,
          <source>in: Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications (BEA</source>
          <year>2024</year>
          ), Mexico City, Mexico,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Abdelghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sauzéon</surname>
          </string-name>
          , P.-Y. Oudeyer,
          <article-title>Selecting better samples from pre-trained LLMs: A case study on question generation, in: Findings of the ACL</article-title>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>12952</fpage>
          -
          <lpage>12965</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bartezzaghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. T.</given-names>
            <surname>Vu</surname>
          </string-name>
          ,
          <article-title>Prompting-based synthetic data generation for few-shot question answering</article-title>
          ,
          <source>in: LREC-COLING</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>13168</fpage>
          -
          <lpage>13178</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Tengler</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Brandhofer, Exploring the diference and quality of ai-generated versus humanwritten texts</article-title>
          ,
          <source>Discover Education</source>
          <volume>4</volume>
          (
          <year>2025</year>
          )
          <fpage>113</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H. T.</given-names>
            <surname>Hakam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Prill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Korte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lovreković</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ostojić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ramadanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Muehlensiepen</surname>
          </string-name>
          ,
          <article-title>Humanwritten vs ai-generated texts in orthopedic academic literature: Comparative qualitative analysis,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>