1. Introduction

TIRTHA: Tourism Information Retrieval and Text-based Hindi Answering

Krishna Tewari

Supriya Chanda

Aarya Chaturvedi

1 0 Bennett University , Greater Noida , INDIA 1 Indian Institute of Technology (BHU) , Varanasi , INDIA

2026

Hindi Tourism QA (HTQA) addresses the challenge of extracting precise answers from Hindi context paragraphs within the specialized tourism domain of Varanasi, where limited annotated resources and complex linguistic structures pose significant hurdles. As part of the FIRE 2025 VATIKA shared task, which focuses on Hindilanguage QA, we developed and evaluated multiple QA approaches using a structured dataset consisting of context-question pairs in JSON format. Three main strategies were explored: (i) fine-tuning the multilingual mT5 model, which demonstrated reasonable language support but occasionally produced fallback answers; (ii) span-based extractive modeling using XLM-RoBERTa, enhanced with post-processing techniques to refine shortspan predictions; and (iii) a zero-shot approach leveraging ChatGPT with batch-wise prompt engineering applied over 50 context-question pairs purely for comparative analysis. Evaluation was performed using BLEU (1-4), ROUGE-L, and QA-F1 metrics. While ChatGPT achieved higher metric scores, only open-source models are considered for leaderboard results; hence, the ChatGPT results are reported separately as ablation.

eol>QA Extractive QA XLM-RoBERTa ChatGPT Zero-shot Learning Tourism

1. Introduction

The rich cultural and spiritual heritage of Varanasi, also known as Kashi, makes it one of the world’s oldest living cities and a prominent pilgrimage destination in India. Renowned for its sacred kunds, temples, and ghats, the city attracts millions of tourists and devotees each year. However, most information about these landmarks exists in unstructured textual formats, which poses significant barriers for Hindi-speaking visitors seeking concise, accurate, and reliable knowledge.

Hindi-language QA (QA) systems ofer a solution by automatically extracting precise answers from large bodies of text, enabling eficient information retrieval for end-users. A typical QA task involves processing a question = ( 1, 2, … , ) in natural Hindi language and retrieving the correct answer span from a given context paragraph = ( 1, 2, … , ). However, several challenges complicate this process in low-resource and specialized domains. First, the lack of large-scale annotated datasets in Hindi limits supervised training of robust models [ 1, 2 ]. Second, domain-specific variability in phrasing, complex syntactic structures, and culturally grounded concepts further increase modeling dificulty [ 3 ]. Third, ambiguity in question formulation and answer granularity creates additional hurdles in achieving precise and reliable retrieval [ 4 ].

To advance research in this direction, the Forum for Information Retrieval Evaluation (FIRE) introduced the VATIKA shared task in 2025 [ 5 ], focusing on Hindi QA in the tourism domain of Varanasi. The dataset comprises structured JSON instances pairing context passages about sacred sites with corresponding questions, providing a valuable benchmark for systematic development and evaluation of QA systems.

In this work, we benchmark multiple QA approaches in this culturally rich, low-resource setting, including transformer-based fine-tuning and zero-shot prompting strategies using large language models as comparative analysis [ 6 ]. Our study demonstrates the potential of these approaches and highlights key challenges in building robust Hindi QA systems for specialized domains, pointing toward promising directions for future research.

The rest of the paper is structured as follows: Section 2 discusses related work; Section 3 describes the dataset; Section 4 presents the proposed methodology; Section 5 reports results and analysis; and Section 6 concludes with key findings.

2. Related Work

The field of QA has been fundamentally reshaped by the introduction of the Transformer architecture, which enabled large pre-trained language models (PLMs) to excel across NLP tasks [ 7 ]. Early breakthroughs such as BERT established the dominant pre-train and fine-tune paradigm, learning rich contextual representations from vast text corpora to achieve state-of-the-art performance on many language tasks [ 8 ].

Two primary paradigms for QA have emerged. Extractive QA, popularized by benchmark datasets like SQuAD [ 1 ], formulates the task as span prediction over a context paragraph. Cross-lingual transformers such as XLM-RoBERTa have demonstrated strong performance in this space by enabling transfer learning across languages [ 9 ]. In contrast, Generative QA treats the task as a text-to-text problem, where models like T5 unify multiple NLP tasks into a single sequence-to-sequence framework [ 10 ].

With the advent of Large Language Models (LLMs) such as GPT-3 and GPT-4, zero-shot and few-shot prompting strategies have gained significant attention. These models perform tasks by interpreting instructions embedded in prompts, often achieving competitive results without task-specific fine-tuning [ 11, 12 ]. Zero-shot prompting has proven especially viable for low-resource settings, allowing models to generalize to unseen tasks [ 6, 13 ].

While most research in QA has focused on high-resource languages such as English, several eforts have extended QA to low-resource and cross-lingual settings. IndicQA and TyDi QA are notable benchmarks focusing on diverse Indian languages, highlighting challenges such as code-mixing, transliteration, and limited data availability [14, 15]. Transfer learning and multilingual pretraining strategies have been proposed to overcome these challenges, demonstrating that models pretrained on multilingual corpora (e.g., mBERT) show strong cross-lingual transferability [16, 17].

Domain-specific QA has also seen increasing interest. Specialized benchmarks in medical, legal, and scientific domains have revealed that generic models often struggle with domain-specific jargon and knowledge representation [18, 19, 20]. Fine-tuning on domain-specific data significantly improves performance but remains challenging in low-resource settings.

Recent studies have started exploring hybrid architectures that combine neural and symbolic methods to improve robustness and interpretability [21, 22]. Such models aim to bridge the gap between purely data-driven approaches and rule-based systems, often improving precision and reducing ambiguity in specialized applications.

Despite these advances, a direct comparative analysis of extractive, generative, and zero-shot paradigms on a low-resource, culturally specific dataset such as the VATIKA Hindi QA remains underexplored. Our work benchmarks these paradigms in a tourism domain setting, shedding light on their practical efectiveness and identifying key areas for future improvement.

3. Dataset

The dataset used in this study is released as part of the FIRE 2025 VATIKA Shared Task on Hindi QA. It is designed to support machine reading comprehension (MRC) and QA applications in the tourism domain of Varanasi, focusing on cultural and spiritual heritage. The dataset is provided in a structured JSON format, organized by domain → context → question-answer pairs.

Each entry is organized into three primary fields: Context, a factual, descriptive paragraph in Hindi (Devanagari script) detailing specific landmarks (e.g., temples, kunds, ghats), historical events, or cultural rituals in Varanasi; Question, a fact-seeking wh-question in Hindi (e.g., “कहाँ,” “कब,” “कौन”), designed to be answerable based solely on the provided context; and Answer, the ground-truth answer, a verbatim span directly extracted from the context paragraph, enforcing an extractive span prediction task.

The dataset is pre-divided into training, validation, and test splits to ensure standardized evaluation. The training set contains 2,452 question-answer pairs, the validation set contains 273 pairs, and the blind test set contains 915 pairs. The full distribution is summarized in Table 1.

The VATIKA dataset covers 10 tourism-relevant domains: Ganga Aarti, Cruise, Food Court, Public Toilet, Kund, Museum, General, Ashram, Temple, and Travel. Each domain includes detailed paragraphlevel contexts followed by multiple question-answer pairs, simulating real-world information-seeking behavior in natural Hindi language.

A representative structured entry from the “kund” domain is shown below:

Domain: kund Contexts:

• मणकणका चक्र पकुष्रणीय कुं ड लाल बहादुर शास्तरी अंतरराीय ह ...

– QID: kund_1467

Question: मणकणका चक्र पकुष्रणीय कुं ड लाल बहादुर शास्तरी अंतरराीय हवाई अेड् (वाराणसी) से कतनी दूर है? Answer: मणकणका चक्र पकुष्रणीय कुं ड लाल बहादुर शास्तरी अंतरराीय हवाई अेड् (वाराणसी) से 25.8 कलोमीटर दूर है। – QID: kund_1468

Question: मणकणका चक्र पकुष्रणीय कुं ड लाल बहादुर शास्तरी अंतरराीय हवाई अेड् के पास से कै से पहुँचा जा सकता है? Answer: मणकणका चक्र पकुष्रणीय कुं ड लाल बहादुर शास्तरी अंतरराीय हवाई अेड् से यह दूरी टसैक्ी या अयन् नजी परवहन के माध्यम से तय क जा सकती है।

A qualitative review of the data highlights several key characteristics. Contexts are rich in proper nouns (e.g., place names, deity names), dates, and factual details. The questions are predominantly factoid, focusing on the retrieval of specific entities rather than complex reasoning or synthesis. Answer spans are typically short phrases directly extracted from the context. This structured and curated dataset provides a robust benchmark for evaluating extractive QA models in a specialized low-resource Hindi setting, promoting research toward domain-specific QA systems.

4. Methodology

We address the problem of developing a robust Hindi QA system for the Varanasi tourism domain. Formally, given a question in Hindi and a set of context paragraph , the goal is to produce an answer that is fluent, factually consistent, and derived strictly from the provided context. This can be expressed as: =̂ arg max ( ∣ , ),

∈ () where () denotes the set of plausible answer spans or sequences within the context. For extractive methods, () is restricted to spans of text that exist verbatim in , while for large language model (LLM) approaches, () encompasses all possible text sequences that can be generated from the context.

The design of our QA system is centered around three complementary computational paradigms: generative QA using fine-tuned mT5, extractive QA using XLM-RoBERTa with post-processing, and zero-shot answer generation using a large language model. These paradigms were selected to leverage their respective strengths and flexibility for generative QA, precision and interpretability for extractive QA, and contextual fluency and completeness for LLM-based generation.

4.1. Generative QA with Fine-Tuned mT5

Our first approach employed the mT5-small model, a multilingual version of the Text-to-Text Transfer Transformer (T5) [ 10 ]. The T5 framework uniquely treats all NLP tasks as a text-generation problem, making it a flexible choice for generative QA. We fine-tuned the model on the oficial training set by providing the question and context as input, with the objective of teaching the model’s decoder to generate the ground-truth answer. Despite its potential to produce fluent responses, this approach proved underwhelming. The model often defaulted to generic, uninformative answers (e.g., “ उर नहीं है” ), suggesting that the limited size of the training corpus was insuficient for robust domain adaptation. This highlighted the significant data and computational requirements of fine-tuning generative models for specialized tasks.

All experiments were conducted using the PyTorch framework and the Hugging Face Transformers library. For the generative approach, the google/mt5-small model was fine-tuned for 5 epochs with a batch size of 8 and a learning rate of 2e-5 using the AdamW optimizer.

4.2. Extractive QA with XLM-RoBERTa and Post-Processing

The extractive paradigm employs XLM-RoBERTa (XLM-R) [ 9 ], a transformer-based model pretrained for cross-lingual understanding, capable of processing Hindi text directly. The model formulates QA as a span prediction problem: given a context paragraph ∈ , it predicts a start token and an end token such that the answer is extracted as:

= +1 … , where denotes the -th token of the context.

Challenges in raw predictions: Despite the model’s accuracy at identifying relevant tokens, we observed two recurring issues: 1. Incomplete spans: The predicted spans were often too short, omitting critical contextual information necessary for coherent understanding. 2. Low-confidence predictions: In cases involving ambiguous questions or rare domain-specific vocabulary, the model occasionally generated predictions with very low confidence scores, leading to unreliable outputs.

To address these challenges, we devised a two-step post-processing pipeline that improves answer completeness and reliability:

1. Sentence Expansion: The predicted span (, ) is mapped back to the full sentence containing it, producing a more comprehensive answer:

expanded = sentence_containing( … ) 2. Confidence Filtering: Predictions with confidence below a threshold (empirically set at 0.05) that are unusually short are further analyzed. We check for the presence of domain-specific keywords (e.g., names of locations, temples, or ghats relevant to Varanasi). If keywords are missing, the answer is replaced with a standard fallback message: final = { expanded, उर उपल नहीं है

if confidence is high or keywords present , , otherwise.

Implementation Details: We use the deepset/xlm-roberta-base-squad2 checkpoint. Context paragraphs are tokenized using XLM-R’s SentencePiece tokenizer. The model processes inputs in batches of 16. By integrating sentence expansion and confidence filtering, this extractive pipeline produces answers that are both accurate and contextually complete while remaining interpretable.

4.3. Zero-Shot Prompting with a Large Language Model

As an ablation experiment, we used large language model (LLM), specifically ChatGPT / GPT-4o mini [ 12 ], in a zero-shot setting. This model was not part of the oficial runs due to task restrictions prohibiting closed-source systems. Unlike extractive QA, LLMs generate answers as free-form sequences of text rather than extracting spans. This approach does not require fine-tuning on domain-specific data.

For each question-context pair (, ) , we construct a detailed prompt that instructs the model to answer strictly using the provided context. The prompt is formulated as: prompt = ‘Please provide answer based on the given context only’ = मणकणका चक्र पकुष्रणीय कुं ड लाल बहादुर शास्तरी अंतरराीय हवाई अेड् (वाराणसी) से कतनी दूर है? = मणकणका चक्र पकुष्रणीय कुं ड लाल बहादुर शास्तरी अंतरराीय ह...

Advantages and rationale: The zero-shot LLM approach ofers several benefits: • Fluency: Answers are generated in grammatically correct and natural Hindi. • Contextual completeness: The model can combine information from multiple sentences to produce richer answers. • High performance without fine-tuning: The model performs well in this domain, making zeroshot prompting efective.

Implementation Details: Prompts are submitted in batches of 50 question-context pairs via the OpenAI. Responses are parsed to extract the answer segment, discarding any additional commentary.

Results from this ablation are reported separately for reference and are excluded from oficial leaderboard discussion.

5. Results

The VATIKA 2025 Shared Task evaluated submissions on Test Data-II using three complementary families of metrics: (i) QA-F1, the primary measure balancing precision and recall; (ii) BLEU-1 to BLEU-4, assessing lexical overlap and fluency across increasing -gram lengths; and (iii) ROUGE-L, capturing the longest common subsequence and content coverage. The oficial leaderboard, covering all participating teams and runs, is presented in Table 2.

IReL’s submissions show a clear progression across runs. Run 1 established a baseline (QA-F1 of 0.4169, BLEU-4 of 15.4), but its precision and recall were limited. Run 2 improved moderately in QA-F1 (0.4612), indicating better overall ranking, while maintaining similar BLEU and ROUGE-L values.

Compared with other teams, IReL’s Run 2 is highly competitive. Its QA-F1 of 0.4612 surpasses all runs from CSE_SVNIT, MUCS, Namaste NLP, NLP Fusion, and IIIT Surat, while also outperforming AiNauts (best QA-F1 of 0.4529). In summary, IReL demonstrated steady improvements across its two runs, culminating in Run 2, which achieved competitive performance against the best systems in the task.

Team

AiNauts CSE_SVNIT IIIT Surat

IReL

MUCS Namaste NLP NLP Fusion Scalar VA-BO-INTERN Run Run 1 Run 2 Run 1 Run 2 Run 3 Run 1 Run 2 Run 3 Run 1 Run 2 Run 1 Run 2 Run 3 Run 1 Run 2 Run 3 Run 1 Run 1 Run 2 Run 3 Run 1 Run 2 Run 3

5.1. Ablation: Zero-Shot Closed-Source Baseline

While Runs 1 and 2 were submitted oficially, an additional ablation using a closed-source ChatGPT model (Run 3) yielded higher scores (QA-F1 of 0.5507, BLEU-1 of 61.5, BLEU-4 of 17.9 and ROGUE-L of 0.0824). These results are provided solely for diagnostic comparison and are excluded from task evaluation due to the use of proprietary models. However, this study indicates the potential of large models for low-resource Hindi QA, motivating exploration of open-source instruction-tuned counterparts in the future.

6. Conclusion and Future Work

The VATIKA 2025 Shared Task showed the dificulty of Hindi question answering in the tourism domain of Varanasi, where data scarcity and linguistic complexity limit system performance. Among the oficial open-source submissions, Run 2 achieved the best performance. An additional ablation with ChatGPT indicated the potential of large models for low-resource Hindi QA. These results confirm that careful refinement leads to better balance across lexical fluency, semantic coverage, and retrieval precision. Still, challenges remain. Systems struggle with domain-specific terms, long contexts, and ambiguous user queries. Future work should focus on fine-tuning multilingual transformers on Hindi tourism data, and using retrieval-augmented generation to improve context–answer alignment. Postprocessing can help make outputs more complete and fluent. Hybrid pipelines combining extractive accuracy with generative flexibility may further improve results. Incorporating structured knowledge of cultural sites can add robustness. Domain-adaptive evaluation and query expansion strategies may also raise coverage. Together, these directions can push Hindi QA toward more accurate, fluent, and user-friendly systems in specialized low-resource settings.

Declaration on Generative AI

During the preparation of this work, the authors used ChatGPT, Grammarly in order to: Grammar and spelling check, Paraphrase and reword. After using these tools, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. [14] D. Kakwani, A. Ghosal, M. Shrivastava, S. Sitaram, V. Sastry, P. Talukdar, Indicnlp suite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for indian languages, in: Findings of the Association for Computational Linguistics (EMNLP), 2020, pp. 4947–4958. [15] C. Clerwall, D. Y. Tang, A survey of question answering in low-resource languages, ACM Computing Surveys 54 (2021) 1–34. [16] T. Pires, E. Schlinger, D. Garrette, How multilingual is multilingual BERT?, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 4996–5001. doi:10.18653/v1/P19- 1493. [17] J. Phang, X. Guo, K. Tran, K. Cho, English is enough! leveraging english data in code-switching language modeling, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021, pp. 2421–2435. [18] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (2020) 1234–1240. [19] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, I. Androutsopoulos, LEGAL-BERT: The muppets straight out of law school, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp. 2898–2904. doi:10. 18653/v1/2020.findings- emnlp.261. [20] I. Beltagy, K. Lo, A. Cohan, SciBERT: A pretrained language model for scientific text, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 3615–3620. doi:10.18653/v1/ D19- 1371. [21] S. Gupta, P. Malik, A. Jaiswal, S. Jha, R. Prasad, Neural-symbolic approaches in natural language processing: A survey, arXiv preprint arXiv:2105.06375 (2021). [22] Z. Dai, Y. Sun, Y. Zhang, Q. Liu, A survey of knowledge-enhanced text generation, IEEE Transactions on Knowledge and Data Engineering 33 (2021) 3567–3584.

[1]

Rajpurkar ,

Zhang ,

Lopyrev ,

Liang , Squad: 100 ,000+ questions for machine comprehension of text , in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2016 , pp. 2383 - 2392 .

[2]

Sun , D. Cheng,

Gan ,

Li ,

Liu ,

Zhou , Investigating transferability of pre-trained language models for neural question answering , arXiv preprint arXiv: 1908 . 08962 ( 2019 ).

[3]

Ruder ,

M. E.

Peters ,

Swayamdipta , T. Wolf, Transfer learning in natural language processing , Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials ( 2019 ) 15 - 18 .

[4]

Chen ,

Fisch ,

Weston ,

Bordes , Reading wikipedia to answer open-domain questions , in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL) , 2017 , pp. 1870 - 1879 .

[5]

Gatla , Anushka,

Kanwar , G. Sahoo,

R. K.

Mundotiya , Tourism question answer system in indian language using domain-adapted foundation models, arXiv preprint ( 2025 ).

[6]

Liang ,

Ling ,

Yu ,

Lin ,

Wang ,

Zhou , Zero-shot question answering by prompting pre-trained language models , arXiv preprint arXiv: 2009 . 07118 ( 2020 ).

[7]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need , in: Advances in Neural Information Processing Systems (NeurIPS) , 2017 , pp. 5998 - 6008 .

[8]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) , 2019 , pp. 4171 - 4186 .

[9]

Conneau ,

Khandelwal ,

Goyal ,

Chaudhary ,

Wenzek ,

Guzmán , E. Grave,

Ott ,

Zettlemoyer ,

Stoyanov , Unsupervised cross-lingual representation learning at scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Online, 2020 , pp. 8440 - 8451 . doi: 10 .18653/v1/ 2020 . acl- main.747.

[10]

Rafel ,

Shazeer ,

Roberts ,

Lee ,

Narang ,

Matena ,

Zhou ,

Li ,

P. J.

Liu , Exploring the limits of transfer learning with a unified text-to-text transformer , Journal of Machine Learning Research 21 ( 2020 ) 1 - 67 .

[11] T. B. Brown , B.

Mann , N.

Ryder , M.

Subbiah , J.

Kaplan , P.

Dhariwal , A.

Neelakantan , P.

Shyam , G.

Sastry , A.

Askell , et al., Language models are few-shot learners , arXiv preprint arXiv: 2005 . 14165 ( 2020 ).

[12] OpenAI , Gpt-4 technical report, arXiv preprint arXiv:2303.08774 ( 2023 ).

[13]

Chanda ,

Tewari ,

Mukherjee ,

Pal , Leveraging chatgpt and xlm-roberta for sarcasm detection in dravidian code-mixed languages , in: Proceedings of FIRE (Working Notes) , Forum for Information Retrieval Evaluation , 2024 , India, 2024 . URL: https://ceur-ws. org/ Vol- 4054 / T4 -14.pdf.