<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Anushka); nabanitas@iitbhilai.ac.in (N. Sadhukhan);
rmundotiya@iitbhilai.ac.in (R. K. Mundotiya)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Towards Indian Intelligent Tourism Assistance: Design and Evaluation of the VATIKA QA Dataset</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Praveen Gatla</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anushka</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nabanita Sadhukhan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rajesh Kumar Mundotiya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, Indian Institute of Technology Bhilai</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Humanistic Studies, Indian Institute of Technology (BHU)</institution>
          ,
          <addr-line>Varanasi</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Linguistics, Faculty of Arts, Banaras Hindu University</institution>
          ,
          <addr-line>Varanasi</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>The VATIKA-2025 shared task aims to advance research in Indic language knowledge augmentation, focusing on generating context-aware answers grounded in culturally rich narratives. Designed as a benchmarking challenge for Indian language technologies, the task-VATIKA provides participants with a carefully curated dataset and evaluates system performance through established NLG and QA metrics, including BLEU, ROUGE, and QA-F1. A total of ten teams participated in the task, of which eight submitted working notes detailing their methodologies. Submissions demonstrated substantial variation in system performance, reflecting diverse modeling strategies such as fine-tuned language models, prompted LLMs, and ensemble-based approaches. The best-performing systems: VA-BO-INTERN (Run-3), IReL (Run-3), and Scaler (Run-1), achieved QA-F1 scores of 0.5757, 0.5507, and 0.5050, respectively, showing strong competency in generating high-quality, semantically aligned responses. This overview paper presents the task design, datasets, evaluation methodology, and a detailed comparative analysis of all team submissions to provide insights into current progress and future directions for Indic knowledge-grounded NLP research.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Question-Answer</kwd>
        <kwd>Tourism</kwd>
        <kwd>Hindi</kwd>
        <kwd>Benchmark</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Varanasi, often described as the spiritual capital of India, has immense historical, cultural, and religious
significance. Every corner of the city tells a story, whether it is the sight of pilgrims taking ritual baths
in the ganga river (ghats), the sound of temple bells echoing through narrow lanes, or the smell of
street food mingling with the chants of evening aarti. For first-time visitors, these experiences can
be profoundly moving yet simultaneously overwhelming, raising questions about the significance of
rituals, the history of sacred sites, or navigating the city’s complex spiritual geography.</p>
      <p>In this context, intelligent systems tailored for tourism can serve as valuable companions, providing
accurate, contextual, and easily understandable information in a language that resonates with users.
Considering this, the VATIKA 2025 shared task was conceived to explore the development of question
answering systems specifically for Varanasi’s tourism domain, with a focus on Hindi as the primary
language. This resource enables participants not only to benchmark their systems but also to engage
with the challenges that arise from working with low-resource languages in culturally rich contexts.</p>
      <p>By bringing together researchers, VATIKA 2025 highlights the role of language technologies in
making Indian cultural heritage more accessible. It reminds us that beyond metrics and models, the
ultimate goal is to create systems that enrich visitors’ journeys (yatra), preserve the stories of a timeless
city, and foster innovation in the growing field of domain-specific question answer system.</p>
    </sec>
    <sec id="sec-2">
      <title>2. VATIKA Task Description</title>
      <p>The VATIKA 2025 shared task focuses on building a QA system to assist tourists in navigating Varanasi,
with Hindi as the main language of interaction. Its aim is to design and evaluate systems that respond
to visitors’ questions, like the timings of the Ganga Aarti, directions to a temple or museum, or the
nearest food court. By grounding the task in such authentic needs, VATIKA connects computational
research to the lived realities of tourism.</p>
      <p>
        A VATIKA dataset- a part of the Manually Created Hindi Question Answer Dataset (MCHQAD)
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]extended to reflect real-world scenarios rather than artificial templates. Covering domains such as
ghats, temples, ashrams, museums, food, travel agencies, and general guidance, the dataset captures the
variety and richness of tourist queries. Emphasizing Hindi addresses the needs of domestic travelers
while also filling a gap in resources, which are often dominated by English or culturally detached.
      </p>
      <p>As shown in Figure 1, the dataset is provided in a structured JSON format organized hierarchically as
domain → context → QAs. The VATIKA dataset spans ten domains: Ganga Aarti, Cruise, Food Court,
Public Toilet, Kund, Museum, Travel Agencies, Ashram, Temple, and General Queries. Each domain
contains context passages, natural Hindi questions, and their corresponding answers. The dataset is
released in four splits—Train, Validation, Test-A, and Test-B. The statistics of these splits, along with
domain-wise distributions of contexts and QA pairs, are presented in Table 1.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology and Results</title>
      <p>working notes. The system ranking for VATIKA is determined based on the QA-F1 score, with
VABO-INTERN (Run-3), IReL (Run-3), and Scaler (Run-1) achieving the first, second, and third positions,
respectively. The methodologies adopted by each team and their corresponding results are summarized
in this section.</p>
      <p>
        IIIT SURAT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] employed a retriever-reader framework centered on a pre-trained IndicBERT
model. To ensure input consistency, Hindi text normalization was performed using the indic-nlp-library,
followed by the alignment of character-level answer boundaries to token-level indices. The model
was fine-tuned for the extractive QA task via the AutoModelForQuestionAnswering architecture,
with Hugging Face Trainer API’s optimizer. For inference, the system integrates FAISS-based semantic
search to retrieve relevant contexts. The model subsequently predicts the optimal start and end token
spans, which are decoded into surface text, supplemented by a fallback mechanism for low-confidence
queries. They demonstrated consistent performance across all three submitted runs. Each run achieves
a BLEU-4 score of 0.2, with minimal variation in the associated metrics, indicating highly stable model
behavior. The corresponding F1 scores are uniformly low, with Run-1, Run-2, and Run-3 all registering
0.0061 for the primary F1 measure, reflecting limited accuracy in the predicted outputs.
      </p>
      <p>
        NLP_Fusion [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] has fine-tuned the mT5-small model on the provided data. They submitted a
single run that achieved a BLEU-4 score of 3.5, indicating limited fluency and n-gram overlap with the
reference texts. The F1 score of approximately 0.28 reflects moderate answer accuracy but suggests
room for improvement.
      </p>
      <p>
        VA-BO-INTERN [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] investigated the eficacy of synthetic data augmentation for Long-Form
Question Answering (LFQA) using Small Language Models (SLMs). The team employed large teacher
models—specifically Llama-3.1-70B and Phi-4-14B to generate synthetic QA pairs via few-shot
prompting on training contexts. Three fine-tuning strategies were evaluated: a baseline Llama-3.1-8B
trained solely on gold data ( 1); a continued fine-tuning approach (  2) where  1 was further trained
on Phi-4-14B synthetic data; and a multi-source strategy ( 3) training on a composite dataset of real
instances plus synthetic samples from both teacher models. To address script-specific challenges, the
tokenizer was optimized for Hindi character handling. VA-BO-INTERN exhibited a clear and consistent
improvement across their three runs, with BLEU-4 scores increasing from 12.5 in Run-1 to 20.6 in
Run-3, indicating enhanced fluency and n-gram alignment with reference texts. Their F1 scores also
remain strong and stable, peaking at 0.5757 in the final run, which reflects accurate and reliable answer
prediction.
      </p>
      <p>
        Scaler [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposed a hybrid encoder-decoder framework designed to decouple understanding
and generation. The system utilizes l3cube-pune/hindi-bert-v2 as an encoder for Hindi text
representation, connected via a linear projection layer to a decoder (ai4bharat/IndicBART) for natural
language generation. This end-to-end architecture is further augmented with a NER module to explicitly
identify entity spans within the context, enhancing interpretability. The Scalar team exhibited a gradual
decline in performance across their three runs. BLEU scores consistently decreased, with BLEU-4
dropping from 22.5 in Run-1 to 5.9 in Run-3, indicating a reduction in n-gram overlap and fluency with
the reference texts. The QA-F1 score also declined notably, from 0.5050 in Run-1 to 0.3518 in Run-3,
suggesting a decrease in the accuracy and reliability of answer prediction.
      </p>
      <p>
        IReL [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] explored a multi-paradigm approach, implementing three distinct strategies: (1) a generative
method fine-tuning mT5 for multilingual adaptability; (2) a span-based extractive approach utilizing
XLM-RoBERTa, supplemented by post-processing heuristics to refine short-span predictions; and (3) a
zero-shot baseline leveraging ChatGPT with batch-wise prompt engineering to establish a comparative
benchmark against the supervised models. Across the three IReL submissions, Run-3 achieved the
strongest overall performance, outperforming the other systems on all BLEU and ROUGE metrics as
well as QA-F1. Specifically, it obtained the highest BLEU-1 (61.5), BLEU-2 (36.4), BLEU-3 (24.5), and
BLEU-4 (17.9) scores, indicating superior n-gram precision. This trend was consistent in the ROUGE
measures, where Run-03 yielded the highest ROUGE-1 (0.0824), ROUGE-2 (0.0467), and ROUGE-L
(0.0824) scores, reflecting better recall-oriented text overlap. Furthermore, it achieved the QA-F1 score
(0.5507), indicating stronger relevance and accuracy of the answers.
      </p>
      <p>
        CSE_SVNIT [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] focused on static embedding architectures to model semantic similarity. The
approach leveraged pre-trained FastText embeddings to generate 300-dimensional vectors, aggregated
into sentence-level representations. These vectors were utilized in two configurations: unsupervised
retrieval via cosine similarity to identify relevant contexts and a supervised ridge regression model for
answer span prediction. Additionally, Word2Vec embeddings were employed to encode dense semantic
vectors, providing a comparative basis for context alignment tasks. They showed a declining trend
in BLEU-4 scores across their three runs, dropping from 10.8 in Run-1 to 7.6 in Run-3, indicating a
reduction in n-gram overlap and fluency with reference texts. Similarly, their F1 scores decrease from
0.4329 in Run-1 to 0.2799 in Run-3, reflecting a decline in answer accuracy and consistency. Despite
this, Run-3 shows a slight increase in precision and recall metrics, suggesting some improvement in
specific aspects of model output quality.
      </p>
      <p>
        AiNauts [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] concentrated on fine-tuning large pre-trained multilingual models, specifically
mBART50 and mT5-small. The preprocessing pipeline involved concatenating the question and context into
a single sequence, truncated to a maximum length of 512 tokens. The models were optimized to
leverage their encoder-decoder attention mechanisms for extracting and generating answers from
the provided Hindi contexts. Between the two AiNauts submissions, Run-1 demonstrated stronger
performance across most evaluation metrics, particularly in ROUGE and QA-F1. Although Run-2
achieved higher BLEU-2 (33.2), BLEU-3 (25.5), and BLEU-4 (19.6) scores, indicating improved multi-gram
precision. Moreover, Run-1 achieved a markedly higher QA-F1 score (0.4529) compared to Run-2 (0.1069),
suggesting considerably better answer accuracy and semantic alignment.
      </p>
      <p>
        MUCS [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] fine-tunes the MuRIL model for the dataset using a structured pipeline consisting of
dataset preparation, preprocessing, and multiple training strategies. Preprocessing employs the MuRIL
tokenizer with sequence-length constraints, sliding windows for long contexts, token-level mapping
of answer spans, and padding with attention masks. Fine-tuning adds a QA-specific linear output
layer to MuRIL to predict answer span,i.e., start and end positions, while the base architecture remains
unchanged. Three training strategies are examined: (1) the Hugging Face Trainer, which automates
optimization and training workflows; (2) a custom AdamW training loop that provides explicit control
over model updates; and (3) a simplified Trainer variant that performs minimal fine-tuning without
evaluation or logging. This setup enables comparison of training eficiency and performance across
diferent fine-tuning approaches. Among the three MUCS submissions, Run-1 delivered the most
balanced and overall strongest performance. It achieved the highest BLEU-1 (36.7), BLEU-3 (13.8), and
BLEU-4 (10.1) scores, along with the ROUGE-1 (0.0759), ROUGE-2 (0.0438), and ROUGE-L (0.0759)
values, indicating superior lexical overlap and recall-driven text similarity. Run-2 showed marginal
improvements over Run-1 only in BLEU-2 (22.0 vs. 20.2) and had higher BLEU-3 and BLEU-4 than
Run-3, but its ROUGE and QA-F1 scores were substantially lower, with QA-F1 dropping to 0.0416. Run-3
exhibited the weakest performance overall, particularly on BLEU metrics, where scores fell below 1 for
BLEU-2 through BLEU-4; however, its ROUGE scores remained moderately comparable to the other
systems.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>The VATIKA-2025 shared task provided a comprehensive platform for evaluating knowledge-grounded
answer generation systems in Indic languages. The diversity of participating teams and methodologies
highlights the growing interest in culturally anchored NLP tasks and the rapid evolution of models
capable of reasoning over narrative contexts. The evaluation results show that systems leveraging larger
pre-trained language models or hybrid architectures consistently outperformed traditional baselines,
achieving higher BLEU, ROUGE, and QA-F1 scores. Among all participants, VA-BO-INTERN (Run-3)
attained the highest QA-F1 score of 0.5757, followed by IReL (Run-3) and Scaler (Run-1), demonstrating
strong capability in producing contextually relevant and semantically accurate responses. At the same
time, several submissions with lower performance highlight ongoing challenges in handling long
contexts, maintaining semantic consistency, and generating fluent responses in Indic languages. Overall,
VATIKA-2025 ofers valuable insights into current system strengths and limitations, establishes new
performance benchmarks, and provides clear directions for future research, particularly in enhancing
reasoning abilities, cultural grounding, and cross-lingual generalization in Indian-language NLP systems.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>As we wrote the paper, we only employed a generative AI assistant in a limited way to facilitate the
writing process. The AI was mostly used to help refine the language, help structure sections, and
maintain consistency in LaTeX format.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgment</title>
      <p>We thank Banaras Hindu University, Varanasi for providing the grant as a part of Transdisciplinary
Research Grant, Institute of Eminence. We also thank the annotators Shreya Pandey, Bhaskar Singh,
Aman Gupta, Himesh Jee Amar, Abhilasha Gupta, and others for extending their hand to create the
VATIKA dataset. We thank Supriya Chauhan, Iram Ali Ahmad, Jyoti Kumari for proofreading the dataset.
We also thank Jagdeesan T, Suresh S. for the academic collaboration during the Transdisciplinary grant,
Institute of Eminence at BHU.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gatla</surname>
          </string-name>
          , Anushka,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kanwar</surname>
          </string-name>
          , G. Sahoo,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Mundotiya</surname>
          </string-name>
          ,
          <article-title>Tourism question answer system in indian language using domain-adapted foundation models, arXiv preprint (</article-title>
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bhatia</surname>
          </string-name>
          ,
          <article-title>Varanasi tourism in question answer system track: Iiit surat @ ifre'25 shared task</article-title>
          , in: Working Notes of FIRE 2025 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          , CEUR Workshop Proceedings, CEUR-WS.org, Varanasi, India,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Coelho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Shetty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Z.</given-names>
            <surname>Taljeh</surname>
          </string-name>
          ,
          <article-title>Hindi tourism qa system: Low-resource question answering using mt5-small</article-title>
          , in: Working Notes of FIRE 2025 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          , CEUR Workshop Proceedings, CEUR-WS.org, Varanasi, India,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Majhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <article-title>Va-bo-intern: Adapting small language models to low-resource domains: A case study in hindi tourism qa</article-title>
          , in: Working Notes of FIRE 2025 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          , CEUR Workshop Proceedings, CEUR-WS.org, Varanasi, India,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P. K. R. N.</given-names>
            <surname>Subbannagari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Velidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Madasamy</surname>
          </string-name>
          ,
          <article-title>Vatika-qa: A hybrid bert-indicbart approach for hindi question answering in tourism domain</article-title>
          , in: Working Notes of FIRE 2025 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          , CEUR Workshop Proceedings, CEUR-WS.org, Varanasi, India,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Tewari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chaturvedi</surname>
          </string-name>
          , Tirtha:
          <article-title>Tourism information retrieval and text-based hindi answering</article-title>
          , in: Working Notes of FIRE 2025 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          , CEUR Workshop Proceedings, CEUR-WS.org, Varanasi, India,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jariwala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Sahu</surname>
          </string-name>
          , Svnit_cse:
          <article-title>Building a question answering system for hindi using wordembedding</article-title>
          , in: Working Notes of FIRE 2025 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          , CEUR Workshop Proceedings, CEUR-WS.org, Varanasi, India,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yadav</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Tagore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Vatika: A hindi machine reading comprehension approach for varanasi tourism question answering using mt5</article-title>
          , in: Working Notes of FIRE 2025 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          , CEUR Workshop Proceedings, CEUR-WS.org, Varanasi, India,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nagaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. L.</given-names>
            <surname>Shashirekha</surname>
          </string-name>
          , Mucs@:
          <article-title>Question answering in hindi for tourism: Evaluation of transformer-based approaches on vatika</article-title>
          , in: Working Notes of FIRE 2025 -
          <article-title>Forum for Information Retrieval Evaluation</article-title>
          , CEUR Workshop Proceedings, CEUR-WS.org, Varanasi, India,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>