1. Introduction

Anushka); nabanitas@iitbhilai.ac.in (N. Sadhukhan); rmundotiya@iitbhilai.ac.in (R. K. Mundotiya)

Towards Indian Intelligent Tourism Assistance: Design and Evaluation of the VATIKA QA Dataset

Praveen Gatla

Anushka

Nabanita Sadhukhan

Rajesh Kumar Mundotiya

0 0 Department of Computer Science and Engineering, Indian Institute of Technology Bhilai , India 1 Department of Humanistic Studies, Indian Institute of Technology (BHU) , Varanasi , India 2 Department of Linguistics, Faculty of Arts, Banaras Hindu University , Varanasi , India

2025

000 0 0002

The VATIKA-2025 shared task aims to advance research in Indic language knowledge augmentation, focusing on generating context-aware answers grounded in culturally rich narratives. Designed as a benchmarking challenge for Indian language technologies, the task-VATIKA provides participants with a carefully curated dataset and evaluates system performance through established NLG and QA metrics, including BLEU, ROUGE, and QA-F1. A total of ten teams participated in the task, of which eight submitted working notes detailing their methodologies. Submissions demonstrated substantial variation in system performance, reflecting diverse modeling strategies such as fine-tuned language models, prompted LLMs, and ensemble-based approaches. The best-performing systems: VA-BO-INTERN (Run-3), IReL (Run-3), and Scaler (Run-1), achieved QA-F1 scores of 0.5757, 0.5507, and 0.5050, respectively, showing strong competency in generating high-quality, semantically aligned responses. This overview paper presents the task design, datasets, evaluation methodology, and a detailed comparative analysis of all team submissions to provide insights into current progress and future directions for Indic knowledge-grounded NLP research.

eol>Question-Answer Tourism Hindi Benchmark

1. Introduction

Varanasi, often described as the spiritual capital of India, has immense historical, cultural, and religious significance. Every corner of the city tells a story, whether it is the sight of pilgrims taking ritual baths in the ganga river (ghats), the sound of temple bells echoing through narrow lanes, or the smell of street food mingling with the chants of evening aarti. For first-time visitors, these experiences can be profoundly moving yet simultaneously overwhelming, raising questions about the significance of rituals, the history of sacred sites, or navigating the city’s complex spiritual geography.

In this context, intelligent systems tailored for tourism can serve as valuable companions, providing accurate, contextual, and easily understandable information in a language that resonates with users. Considering this, the VATIKA 2025 shared task was conceived to explore the development of question answering systems specifically for Varanasi’s tourism domain, with a focus on Hindi as the primary language. This resource enables participants not only to benchmark their systems but also to engage with the challenges that arise from working with low-resource languages in culturally rich contexts.

By bringing together researchers, VATIKA 2025 highlights the role of language technologies in making Indian cultural heritage more accessible. It reminds us that beyond metrics and models, the ultimate goal is to create systems that enrich visitors’ journeys (yatra), preserve the stories of a timeless city, and foster innovation in the growing field of domain-specific question answer system.

2. VATIKA Task Description

The VATIKA 2025 shared task focuses on building a QA system to assist tourists in navigating Varanasi, with Hindi as the main language of interaction. Its aim is to design and evaluate systems that respond to visitors’ questions, like the timings of the Ganga Aarti, directions to a temple or museum, or the nearest food court. By grounding the task in such authentic needs, VATIKA connects computational research to the lived realities of tourism.

A VATIKA dataset- a part of the Manually Created Hindi Question Answer Dataset (MCHQAD) [ 1 ]extended to reflect real-world scenarios rather than artificial templates. Covering domains such as ghats, temples, ashrams, museums, food, travel agencies, and general guidance, the dataset captures the variety and richness of tourist queries. Emphasizing Hindi addresses the needs of domestic travelers while also filling a gap in resources, which are often dominated by English or culturally detached.

As shown in Figure 1, the dataset is provided in a structured JSON format organized hierarchically as domain → context → QAs. The VATIKA dataset spans ten domains: Ganga Aarti, Cruise, Food Court, Public Toilet, Kund, Museum, Travel Agencies, Ashram, Temple, and General Queries. Each domain contains context passages, natural Hindi questions, and their corresponding answers. The dataset is released in four splits—Train, Validation, Test-A, and Test-B. The statistics of these splits, along with domain-wise distributions of contexts and QA pairs, are presented in Table 1.

3. Methodology and Results

working notes. The system ranking for VATIKA is determined based on the QA-F1 score, with VABO-INTERN (Run-3), IReL (Run-3), and Scaler (Run-1) achieving the first, second, and third positions, respectively. The methodologies adopted by each team and their corresponding results are summarized in this section.

IIIT SURAT [ 2 ] employed a retriever-reader framework centered on a pre-trained IndicBERT model. To ensure input consistency, Hindi text normalization was performed using the indic-nlp-library, followed by the alignment of character-level answer boundaries to token-level indices. The model was fine-tuned for the extractive QA task via the AutoModelForQuestionAnswering architecture, with Hugging Face Trainer API’s optimizer. For inference, the system integrates FAISS-based semantic search to retrieve relevant contexts. The model subsequently predicts the optimal start and end token spans, which are decoded into surface text, supplemented by a fallback mechanism for low-confidence queries. They demonstrated consistent performance across all three submitted runs. Each run achieves a BLEU-4 score of 0.2, with minimal variation in the associated metrics, indicating highly stable model behavior. The corresponding F1 scores are uniformly low, with Run-1, Run-2, and Run-3 all registering 0.0061 for the primary F1 measure, reflecting limited accuracy in the predicted outputs.

NLP_Fusion [ 3 ] has fine-tuned the mT5-small model on the provided data. They submitted a single run that achieved a BLEU-4 score of 3.5, indicating limited fluency and n-gram overlap with the reference texts. The F1 score of approximately 0.28 reflects moderate answer accuracy but suggests room for improvement.

VA-BO-INTERN [ 4 ] investigated the eficacy of synthetic data augmentation for Long-Form Question Answering (LFQA) using Small Language Models (SLMs). The team employed large teacher models—specifically Llama-3.1-70B and Phi-4-14B to generate synthetic QA pairs via few-shot prompting on training contexts. Three fine-tuning strategies were evaluated: a baseline Llama-3.1-8B trained solely on gold data ( 1); a continued fine-tuning approach ( 2) where 1 was further trained on Phi-4-14B synthetic data; and a multi-source strategy ( 3) training on a composite dataset of real instances plus synthetic samples from both teacher models. To address script-specific challenges, the tokenizer was optimized for Hindi character handling. VA-BO-INTERN exhibited a clear and consistent improvement across their three runs, with BLEU-4 scores increasing from 12.5 in Run-1 to 20.6 in Run-3, indicating enhanced fluency and n-gram alignment with reference texts. Their F1 scores also remain strong and stable, peaking at 0.5757 in the final run, which reflects accurate and reliable answer prediction.

Scaler [ 5 ] proposed a hybrid encoder-decoder framework designed to decouple understanding and generation. The system utilizes l3cube-pune/hindi-bert-v2 as an encoder for Hindi text representation, connected via a linear projection layer to a decoder (ai4bharat/IndicBART) for natural language generation. This end-to-end architecture is further augmented with a NER module to explicitly identify entity spans within the context, enhancing interpretability. The Scalar team exhibited a gradual decline in performance across their three runs. BLEU scores consistently decreased, with BLEU-4 dropping from 22.5 in Run-1 to 5.9 in Run-3, indicating a reduction in n-gram overlap and fluency with the reference texts. The QA-F1 score also declined notably, from 0.5050 in Run-1 to 0.3518 in Run-3, suggesting a decrease in the accuracy and reliability of answer prediction.

IReL [ 6 ] explored a multi-paradigm approach, implementing three distinct strategies: (1) a generative method fine-tuning mT5 for multilingual adaptability; (2) a span-based extractive approach utilizing XLM-RoBERTa, supplemented by post-processing heuristics to refine short-span predictions; and (3) a zero-shot baseline leveraging ChatGPT with batch-wise prompt engineering to establish a comparative benchmark against the supervised models. Across the three IReL submissions, Run-3 achieved the strongest overall performance, outperforming the other systems on all BLEU and ROUGE metrics as well as QA-F1. Specifically, it obtained the highest BLEU-1 (61.5), BLEU-2 (36.4), BLEU-3 (24.5), and BLEU-4 (17.9) scores, indicating superior n-gram precision. This trend was consistent in the ROUGE measures, where Run-03 yielded the highest ROUGE-1 (0.0824), ROUGE-2 (0.0467), and ROUGE-L (0.0824) scores, reflecting better recall-oriented text overlap. Furthermore, it achieved the QA-F1 score (0.5507), indicating stronger relevance and accuracy of the answers.

CSE_SVNIT [ 7 ] focused on static embedding architectures to model semantic similarity. The approach leveraged pre-trained FastText embeddings to generate 300-dimensional vectors, aggregated into sentence-level representations. These vectors were utilized in two configurations: unsupervised retrieval via cosine similarity to identify relevant contexts and a supervised ridge regression model for answer span prediction. Additionally, Word2Vec embeddings were employed to encode dense semantic vectors, providing a comparative basis for context alignment tasks. They showed a declining trend in BLEU-4 scores across their three runs, dropping from 10.8 in Run-1 to 7.6 in Run-3, indicating a reduction in n-gram overlap and fluency with reference texts. Similarly, their F1 scores decrease from 0.4329 in Run-1 to 0.2799 in Run-3, reflecting a decline in answer accuracy and consistency. Despite this, Run-3 shows a slight increase in precision and recall metrics, suggesting some improvement in specific aspects of model output quality.

AiNauts [ 8 ] concentrated on fine-tuning large pre-trained multilingual models, specifically mBART50 and mT5-small. The preprocessing pipeline involved concatenating the question and context into a single sequence, truncated to a maximum length of 512 tokens. The models were optimized to leverage their encoder-decoder attention mechanisms for extracting and generating answers from the provided Hindi contexts. Between the two AiNauts submissions, Run-1 demonstrated stronger performance across most evaluation metrics, particularly in ROUGE and QA-F1. Although Run-2 achieved higher BLEU-2 (33.2), BLEU-3 (25.5), and BLEU-4 (19.6) scores, indicating improved multi-gram precision. Moreover, Run-1 achieved a markedly higher QA-F1 score (0.4529) compared to Run-2 (0.1069), suggesting considerably better answer accuracy and semantic alignment.

MUCS [ 9 ] fine-tunes the MuRIL model for the dataset using a structured pipeline consisting of dataset preparation, preprocessing, and multiple training strategies. Preprocessing employs the MuRIL tokenizer with sequence-length constraints, sliding windows for long contexts, token-level mapping of answer spans, and padding with attention masks. Fine-tuning adds a QA-specific linear output layer to MuRIL to predict answer span,i.e., start and end positions, while the base architecture remains unchanged. Three training strategies are examined: (1) the Hugging Face Trainer, which automates optimization and training workflows; (2) a custom AdamW training loop that provides explicit control over model updates; and (3) a simplified Trainer variant that performs minimal fine-tuning without evaluation or logging. This setup enables comparison of training eficiency and performance across diferent fine-tuning approaches. Among the three MUCS submissions, Run-1 delivered the most balanced and overall strongest performance. It achieved the highest BLEU-1 (36.7), BLEU-3 (13.8), and BLEU-4 (10.1) scores, along with the ROUGE-1 (0.0759), ROUGE-2 (0.0438), and ROUGE-L (0.0759) values, indicating superior lexical overlap and recall-driven text similarity. Run-2 showed marginal improvements over Run-1 only in BLEU-2 (22.0 vs. 20.2) and had higher BLEU-3 and BLEU-4 than Run-3, but its ROUGE and QA-F1 scores were substantially lower, with QA-F1 dropping to 0.0416. Run-3 exhibited the weakest performance overall, particularly on BLEU metrics, where scores fell below 1 for BLEU-2 through BLEU-4; however, its ROUGE scores remained moderately comparable to the other systems.

4. Conclusion

The VATIKA-2025 shared task provided a comprehensive platform for evaluating knowledge-grounded answer generation systems in Indic languages. The diversity of participating teams and methodologies highlights the growing interest in culturally anchored NLP tasks and the rapid evolution of models capable of reasoning over narrative contexts. The evaluation results show that systems leveraging larger pre-trained language models or hybrid architectures consistently outperformed traditional baselines, achieving higher BLEU, ROUGE, and QA-F1 scores. Among all participants, VA-BO-INTERN (Run-3) attained the highest QA-F1 score of 0.5757, followed by IReL (Run-3) and Scaler (Run-1), demonstrating strong capability in producing contextually relevant and semantically accurate responses. At the same time, several submissions with lower performance highlight ongoing challenges in handling long contexts, maintaining semantic consistency, and generating fluent responses in Indic languages. Overall, VATIKA-2025 ofers valuable insights into current system strengths and limitations, establishes new performance benchmarks, and provides clear directions for future research, particularly in enhancing reasoning abilities, cultural grounding, and cross-lingual generalization in Indian-language NLP systems.

Declaration on Generative AI

As we wrote the paper, we only employed a generative AI assistant in a limited way to facilitate the writing process. The AI was mostly used to help refine the language, help structure sections, and maintain consistency in LaTeX format.

Acknowledgment

We thank Banaras Hindu University, Varanasi for providing the grant as a part of Transdisciplinary Research Grant, Institute of Eminence. We also thank the annotators Shreya Pandey, Bhaskar Singh, Aman Gupta, Himesh Jee Amar, Abhilasha Gupta, and others for extending their hand to create the VATIKA dataset. We thank Supriya Chauhan, Iram Ali Ahmad, Jyoti Kumari for proofreading the dataset. We also thank Jagdeesan T, Suresh S. for the academic collaboration during the Transdisciplinary grant, Institute of Eminence at BHU.

[1]

Gatla , Anushka,

Kanwar , G. Sahoo,

R. K.

Mundotiya , Tourism question answer system in indian language using domain-adapted foundation models, arXiv preprint ( 2025 ).

[2]

Kumar ,

S. C.

Jaiswal ,

Bhatia , Varanasi tourism in question answer system track: Iiit surat @ ifre'25 shared task , in: Working Notes of FIRE 2025 - Forum for Information Retrieval Evaluation , CEUR Workshop Proceedings, CEUR-WS.org, Varanasi, India, 2025 .

[3]

Hegde ,

Coelho ,

A. M.

Shetty ,

M. Z.

Taljeh , Hindi tourism qa system: Low-resource question answering using mt5-small , in: Working Notes of FIRE 2025 - Forum for Information Retrieval Evaluation , CEUR Workshop Proceedings, CEUR-WS.org, Varanasi, India, 2025 .

[4]

Majhi ,

Bhattacharya , Va-bo-intern: Adapting small language models to low-resource domains: A case study in hindi tourism qa , in: Working Notes of FIRE 2025 - Forum for Information Retrieval Evaluation , CEUR Workshop Proceedings, CEUR-WS.org, Varanasi, India, 2025 .

[5]

P. K. R. N.

Subbannagari ,

A. S.

Velidi ,

A. K.

Madasamy , Vatika-qa: A hybrid bert-indicbart approach for hindi question answering in tourism domain , in: Working Notes of FIRE 2025 - Forum for Information Retrieval Evaluation , CEUR Workshop Proceedings, CEUR-WS.org, Varanasi, India, 2025 .

[6]

Tewari ,

Chanda ,

Chaturvedi , Tirtha: Tourism information retrieval and text-based hindi answering , in: Working Notes of FIRE 2025 - Forum for Information Retrieval Evaluation , CEUR Workshop Proceedings, CEUR-WS.org, Varanasi, India, 2025 .

[7]

Jariwala ,

S. S.

Sahu , Svnit_cse: Building a question answering system for hindi using wordembedding , in: Working Notes of FIRE 2025 - Forum for Information Retrieval Evaluation , CEUR Workshop Proceedings, CEUR-WS.org, Varanasi, India, 2025 .

[8]

Mishra ,

Yadav ,

N. K.

Tagore ,

R. K.

Kumar , Vatika: A hindi machine reading comprehension approach for varanasi tourism question answering using mt5 , in: Working Notes of FIRE 2025 - Forum for Information Retrieval Evaluation , CEUR Workshop Proceedings, CEUR-WS.org, Varanasi, India, 2025 .

[9]

Nagaraju ,

H. L.

Shashirekha , Mucs@: Question answering in hindi for tourism: Evaluation of transformer-based approaches on vatika , in: Working Notes of FIRE 2025 - Forum for Information Retrieval Evaluation , CEUR Workshop Proceedings, CEUR-WS.org, Varanasi, India, 2025 .