<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>VATIKA-QA: A Hybrid BERT-IndicBART Approach for Hindi Question Answering in Tourism Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Prudvi Kumar Reddy Narreddi Subbannagari</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Avanish S Velidi</string-name>
          <email>avanish.232it500@nitk.edu.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anand Kumar Madasamy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Technology, National Institute of Technology</institution>
          ,
          <addr-line>Surathkal</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Mohan Babu University</institution>
          ,
          <addr-line>Tirupati, Andhra Pradesh</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>This paper describes the development and evaluation of an innovative hybrid encoder-decoder model for Question Answering (QA) in Hindi, focusing on the tourism sector. The importance of this research is emphasized by the increasing demand for accessible information systems, especially for languages with limited digital resources like Hindi in culturally significant regions such as Varanasi. The proposed framework uses the l3cubepune/hindi-bert-v2 model as an efective encoder to interpret Hindi text and the ai4bharat/IndicBART model as a decoder to generate answers in natural language. The core of this system is the VATIKA dataset, a comprehensive Hindi QA resource created for machine reading comprehension within the Varanasi tourism context, covering ten diferent domains. The methodology involves integrating BERT and IndicBART through a linear projection layer. An additional strategy explored was the use of Named Entity Recognition (NER) to explicitly tag entities. Experimental results demonstrate the high performance of the basic BERT-IndicBART hybrid model. However, an important finding was that adding NER-based tags to the input unexpectedly caused a consistent decrease in performance across all metrics, including Exact Match, F1 Score, BLEU, and ROUGE. This paper discusses the implications of this counterintuitive result, providing empirical data on the challenges of incorporating external knowledge into QA systems for low-resource languages.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Question Answering (QA)</kwd>
        <kwd>Natural Language Processing (NLP)</kwd>
        <kwd>Hindi</kwd>
        <kwd>Tourism</kwd>
        <kwd>Encoder-Decoder Models</kwd>
        <kwd>BERT</kwd>
        <kwd>IndicBART</kwd>
        <kwd>Multilingual QA</kwd>
        <kwd>Information Retrieval</kwd>
        <kwd>Varanasi</kwd>
        <kwd>VATIKA Dataset</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>As a key contributor to the Indian economy, tourism is instrumental in generating revenue, creating
employment, and sustaining local enterprises. It also fosters cultural interchange, aids in the
preservation of heritage, and stimulates the development of infrastructure. Varanasi, one of the world’s oldest
continually inhabited cities and a prominent spiritual and cultural hub, is a standout among India’s
numerous tourist sites. The city attracts millions of visitors annually for spiritual and cultural pursuits,
presenting distinct information access challenges due to its specialized terminology, cultural richness,
and location-specific logistics (e.g., Ganga Aarti, Kunds, Ashrams). A standard QA system would find
it dificult to grasp these subtleties, which underscores the necessity for a solution that is both
culturally aware and specific to the domain. Creating such a system for Varanasi could establish a model for
other culturally rich and linguistically diverse areas.</p>
      <p>This research originates from the “VATIKA: Varanasi Tourism in Question Answer System (Indian
language)” shared task at FIRE 2025, which seeks to build a Hindi-based multilingual QA system
focused on tourism in Varanasi. In contrast to conventional Information Retrieval (IR) systems that
provide whole documents, contemporary QA systems use breakthroughs in Natural Language Processing
(NLP) to ofer precise and brief responses to user inquiries. This paper introduces a hybrid
encoderdecoder model for Hindi QA that has been trained using the VATIKA dataset. A primary contribution
of this research is the integration of Named Entity Recognition (NER) to improve the model’s
comprehension of input, along with an empirical assessment of its efects. This endeavor not only furthers
the development of domain-specific QA for languages with limited resources but also adheres to high
research standards through its rigorous methodology and evaluation.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>Research on Question Answering has spanned multiple paradigms, from early rule-based systems to
modern deep learning frameworks. In this section, we review prior works relevant to our study,
focusing on general QA architectures, multilingual challenges, encoder-decoder approaches, and the role of
NER in enhancing QA models.</p>
      <sec id="sec-2-1">
        <title>2.1. Overview of Question Answering Systems</title>
        <p>
          Question Answering systems have undergone substantial evolution, progressing from methods based
on rules and information retrieval to advanced machine learning frameworks. Contemporary QA
systems are typically classified into several types, including extractive QA, where the answer is
identiifed as a segment of text; abstractive QA, which involves generating a new answer from the context;
knowledge-based QA, which queries organized knowledge bases; and systems designed for either open
or closed domains. The emergence of transformer-based models [8] like BERT [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], GPT-2 [11], and
BART [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] has transformed the discipline, empowering models to discern intricate linguistic structures
and produce answers that are logical and contextually appropriate. These models, having been
pretrained on immense volumes of text, exhibit extraordinary abilities in understanding natural language
and executing a variety of subsequent tasks, QA included.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Multilingual and Low-Resource Language QA</title>
        <p>
          Creating efective QA systems for languages besides English poses distinct dificulties, particularly for
those deemed low-resource due to a scarcity of digital text data or computational resources [12]. Hindi,
despite its large number of speakers, frequently fits this description in the context of sophisticated NLP
applications. Although universal multilingual models such as mBART [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], mT5 [13], and GPT-2 [11]
are built to process various languages, their efectiveness can be less than optimal for particular
lowresource languages or specialized domains. Initial explorations revealed that these general-purpose
multilingual models had considerable trouble understanding Hindi context and questions, often
producing text that lacked semantic coherence or relevance to the query. This suggested a basic
deficiency in their capacity to genuinely acquire and apply the subtleties of Hindi for intricate tasks like
question answering, rather than just generating sequences that resemble Hindi. This finding
highlighted the need for a tailored approach, specifically using models that were pre-trained or fine-tuned
on Hindi or other Indian languages, to obtain meaningful results. The dificulties faced with these
generic models provided the justification for the specialized hybrid strategy employed in this research,
which utilizes models created specifically for Indian languages like l3cube-pune/hindi-bert-v2 [ 14] and
IndicBART [16].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Encoder-Decoder Architectures</title>
        <p>
          Encoder-decoder frameworks are essential for numerous sequence-to-sequence operations in NLP [9],
such as abstractive QA, machine translation, and text summarization. In this model, an encoder takes
an input sequence (like a question and its context) to form a detailed contextual representation.
Subsequently, a decoder utilizes this representation to produce the output sequence, such as the answer.
Models like BERT [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] are particularly efective as encoders because of their bidirectional grasp of
context, which makes them highly proficient at understanding input text. In contrast, models such as
BART [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and T5 [10] are excellent choices for decoders, as they can generate fluent and logical text.
The pairing of a high-performance encoder for comprehension with a dependable decoder for
generation creates a powerful structure for abstractive QA.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Named Entity Recognition (NER) in QA</title>
        <p>NER is a subfield of information extraction that focuses on identifying and categorizing named entities
within text into established classes like names of people, organizations, places, times, amounts,
monetary figures, and percentages [ 7]. Within the scope of QA, NER can act as a significant enhancement
by explicitly marking key entities in the source text. This has the potential to direct the QA model’s
focus toward critical information, consequently enhancing its capability to either extract or formulate
precise answers. For example, by classifying “Varanasi” with a &lt;LOCATION&gt; tag, the model may
better grasp its significance in a query regarding tourist attractions. Recent work like Naamapadam [ 15]
has provided large-scale named entity annotated datasets for Indian languages, enabling better NER
capabilities for Hindi and other Indic languages.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. VATIKA Dataset</title>
      <p>To evaluate our proposed approach, we utilize the VATIKA dataset, a specialized resource curated for
Hindi Question Answering in the tourism domain. This section outlines the dataset’s composition,
thematic scope, and statistical properties, which collectively highlight its uniqueness and relevance
for low-resource NLP research.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset Overview</title>
        <p>The VATIKA dataset, introduced in the work by Gatla et al. [17], is a specially created Hindi-language
Question Answering (QA) resource developed to facilitate Machine Reading Comprehension (MRC)
and QA tasks related to tourism. It is centered on the culturally vibrant city of Varanasi, and its
content is carefully designed to mirror authentic queries that tourists and pilgrims might have about
locations, logistical information, available amenities, and spiritual sites. A key characteristic of the
VATIKA dataset is its wide-ranging scope, covering ten separate domains relevant to tourism. These
areas encompass Ganga Aarti, cruise services, food courts, public restrooms, Kunds, museums,
general information, Ashrams, temples, and travel-related queries. Every domain in the dataset contains
thorough paragraph-long contexts in Hindi, each followed by several question-answer pairs. This
arrangement is intended to replicate real-life information-seeking scenarios using natural language,
ofering a comprehensive and diverse platform for testing QA systems. The questions vary widely,
from those seeking factual information to inquiries about navigation and experiences, thereby
covering a broad spectrum of common tourist questions. The entire VATIKA dataset is in Hindi, written in
the Devanagari script, which makes it an important linguistic tool for creating and assessing QA
systems designed for Indian languages. It is suitable for both open-domain and context-based MRC-style
question answering, rendering it adaptable for a range of research applications.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Dataset Statistics</title>
        <p>The VATIKA dataset is divided into separate subsets for the purposes of model training, validation,
and testing. Table 1 provides a breakdown of the distribution of contexts and QA pairs among these
divisions.</p>
        <p>For this research, the initial training and validation collections were merged to create a more
extensive training dataset. A segment of the oficial test data was subsequently used as a validation set
during the stages of model development and parameter adjustment. Although this method expands the
amount of data for training, it is crucial to recognize that using a portion of the test set for validation
could result in an overly positive assessment of the model’s ability to generalize to the complete,
unseen test set. This is due to the possibility that the model may indirectly adapt to certain features of the
ifnal evaluation data. Consequently, the final test results presented should be viewed in consideration
of this particular data allocation method.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>Our methodology integrates a hybrid encoder–decoder framework with additional NER-based
enhancements. This section describes the problem formulation, training configuration, architectural
choices, and the mechanisms used for answer generation. The design rationale behind each component
is also discussed.</p>
      <sec id="sec-4-1">
        <title>4.1. Problem Formulation</title>
        <p>The task of Question Answering is framed as a problem of sequence-to-sequence conversion. The
model receives an input that is a concatenated sequence of the user’s question and the associated
context document, with both elements presented in Hindi. The model’s goal is to produce a natural
language answer sequence in Hindi that provides a direct response to the question, based on the given
context.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Training Configuration</title>
        <p>
          The model’s training was carried out with the AdamW optimizer [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], set up with a learning rate of
3 × 10−5 and a weight decay of 0.01. To properly regulate the learning rate during training, a linear
schedule with a warmup phase was used. The total number of training steps was determined based on
running for 5 epochs over the consolidated training data. For efective training, the model was run on
a computational device equipped with a GPU.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Proposed Hybrid Model (BERT-Encoder, IndicBART-Decoder)</title>
        <sec id="sec-4-3-1">
          <title>4.3.1. Data Preparation (QADataset)</title>
          <p>The QADataset class handles the preparation of input and target data for the hybrid model. The
encoder input is created by concatenating the question and context into a single string with the format
“question: [question text] context: [context text]”. This merged text is subsequently tokenized with the
l3cube-pune/hindi-bert-v2 [14] tokenizer. To maintain consistent input sizes, sequences were padded
or truncated to a maximum length of 512 tokens. For the decoder’s target, the correct answer is
tokenized via the ai4bharat/IndicBART [16] tokenizer, with a maximum length of 128 tokens. A key part
of this procedure is the adjustment of special tokens. The BERT tokenizer employs [CLS] and [SEP]
tokens, but IndicBART needs &lt;s&gt; (start of sentence) and &lt;/s&gt; (end of sentence) tokens for generation.
Consequently, the token ID for [CLS] from BERT’s vocabulary is substituted with IndicBART’s &lt;s&gt; ID,
and the [SEP] token ID is swapped with the &lt;/s&gt; ID in the target sequences. Additionally, padding
tokens within the target sequence are masked by changing their IDs to -100. This masking guarantees
that the loss function overlooks these padding tokens during backpropagation, concentrating only on
the actual content of the answer. These apparently minor adjustments are essential for the model’s
capacity to learn and produce logical answers, reflecting a thorough strategy for data preparation and
the alignment of diferent model parts.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3.2. Encoder-Decoder Integration</title>
          <p>A vital feature of this hybrid design is the smooth connection between the BERT encoder and the
IndicBART decoder. The l3cube-pune/hindi-bert-v2 [14] model produces hidden states with a
dimension of 768, whereas the ai4bharat/IndicBART [16] model requires an input dimension of 1024 for its
encoder outputs. To resolve this dimensional inconsistency and facilitate eficient information flow,
a linear projection layer was integrated. This layer performs a learnable transformation, converting
the 768-dimensional hidden states from BERT into the 1024-dimensional format that IndicBART
anticipates. This design decision exemplifies a refined method for merging disparate pre-trained models,
underlining the adaptability needed when building custom architectures from components trained
separately. The efectiveness of this projection layer can heavily impact the model’s overall performance,
marking it as a candidate for additional refinement, such as investigating more intricate non-linear
transformations.</p>
          <p>The hybrid BERT-IndicBART architecture is illustrated in Figure 1.</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>4.3.3. Answer Generation</title>
          <p>To produce answers, the generate method of the BertIndicBARTQA model was employed, which uses
a beam search with four beams. This approach aids in considering multiple possible answer sequences
to identify a better one, instead of just choosing the most likely token at every stage. A number of
parameters were fine-tuned empirically to improve the quality and smoothness of the generated answers.
These included setting a maximum of 128 new tokens to cap answer length, enforcing a minimum
length of five tokens to prevent overly short responses, and enabling early stopping to halt generation
once a complete sequence is found. To further improve quality, a repetition penalty of 1.2 was applied
to deter repetitive phrasing, a length penalty of 1.0 was used to avoid favoring overly long or short
answers, and recurring trigrams were prevented to enhance fluency. Moreover, specific bad word IDs
were set for the [CLS] and [SEP] tokens to stop them from being generated, making sure the output
conforms to natural language standards. The decoder start token ID was dynamically configured to the
hi_IN language code ID for IndicBART, guaranteeing the correct start for language-specific generation.
This meticulousness in the generation plan indicates a sophisticated approach to abstractive QA, where
the quality of the generated text is considered as crucial as the model’s underlying understanding.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. NER-Enhanced Model</title>
        <sec id="sec-4-4-1">
          <title>4.4.1. NER Integration Strategy</title>
          <p>To explore the possible advantages of supplying direct semantic indicators, a second model version
was created that integrates NER into the input text. The goal of this method was to emphasize key
entities, like locations or organizations, within the input sequence itself, thereby directing the focus of
the QA model. The ai4bharat/IndicNER [15] tokenizer and model were utilized for this task.</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>4.4.2. NER Processing</title>
          <p>The integration was carried out using two primary functions. The first, get_ner, tokenizes an
input sentence, retrieves the predicted entity categories from the ai4bharat/IndicNER model, and then
matches these entity labels to the original words in the sentence as shown in Figure 2. This results in a
list of (word, entity_label) pairs. The second function, enhance_with_ner, then handles this
output. For any word recognized as a named entity (i.e., not having the “O” label for “Other”), it embeds
distinct XML-style tags around the word to show its entity category. As an illustration, if “Varanasi”
is recognized as a location, it would be transformed to &lt;LOCATION&gt;Varanasi&lt;/LOCATION&gt;. This
technique of adding explicit tags directly into the input text is a frequently used method to give models
structural information.</p>
        </sec>
        <sec id="sec-4-4-3">
          <title>4.4.3. Integration with QA Dataset</title>
          <p>This NER-based modification was implemented within the QADataset.__getitem__ method. The raw
combined question and context string (input_text_raw) was first processed by the enhance_with_ner
function. The modified string ( input_text) was subsequently used as the input for tokenization by the
l3cube-pune/hindi-bert-v2 tokenizer. Although this approach seems logical, it can create dificulties,
as the new tags become unfamiliar tokens that the BERT tokenizer, pre-trained on plain text, might
not process semantically. This could potentially interfere with original tokenization patterns or result
in the tags being interpreted as random noise.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments and Results</title>
      <p>We conducted a series of experiments to assess the efectiveness of our proposed hybrid model, as well
as its NER-augmented variant. This section details the experimental setup, evaluation metrics, and a
comparative performance analysis against baseline models.</p>
      <sec id="sec-5-1">
        <title>5.1. Experimental Setup</title>
        <p>
          The experiments utilized the l3cube-pune/hindi-bert-v2 [14] tokenizer for processing inputs and the
ai4bharat/IndicBART [16] tokenizer for generating target outputs. Training was conducted for 5
epochs with the AdamW optimizer [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], which was set to a learning rate of 3 × 10−5 and a weight decay
of 0.01. A linear learning rate schedule that included warmup steps was also implemented. To
guarantee swift training and inference, all computational tasks were executed within a GPU-accelerated
environment.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Evaluation Metrics</title>
        <p>A comprehensive set of established evaluation metrics was used to rigorously measure the performance
of the QA models:</p>
        <p>Exact Match (EM): This metric measures the percentage of predictions that precisely match the
ground truth answer. It is a strict measure, requiring character-level identity.</p>
        <p>F1 Score: The harmonic mean of precision and recall, the F1 score is particularly useful for
evaluating tasks where partial matches are acceptable. It provides a balanced measure of a model’s ability to
retrieve relevant information and avoid irrelevant information.</p>
        <p>
          BLEU (Bilingual Evaluation Understudy): BLEU [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] is a widely used metric for evaluating the
quality of text generated by a machine translation system, but it is also applicable to abstractive QA. It
quantifies the n-gram overlap between the generated answer and a set of reference answers, indicating
lfuency and adequacy. BLEU-1 considers unigram overlap, while BLEU-2 considers bigram overlap.
        </p>
        <p>
          ROUGE (Recall-Oriented Understudy for Gisting Evaluation): ROUGE [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] is a set of metrics
commonly used for evaluating summarization and machine translation. It measures the overlap of
ngrams, word sequences, and word pairs between the generated answer and reference answers.
ROUGE1 measures unigram overlap, ROUGE-2 measures bigram overlap, and ROUGE-L measures the longest
common subsequence, reflecting sentence-level structure and fluency.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Performance Analysis</title>
        <p>The performance of both the base BERT-IndicBART hybrid model and the NER-enhanced variant was
evaluated on the test set. The results are summarized in Table 2 and Table 3.</p>
        <p>The results reveal several important findings. The base BERT-IndicBART hybrid model achieved
strong performance on the validation set, with an exact match score of 57.6% and an F1 score of 0.9092.</p>
        <p>Notably, the addition of NER tags consistently decreased performance across all evaluation metrics.
This counterintuitive result warrants a deeper analysis. A critical factor was a practical limitation
imposed by the NER integration strategy. The process of inserting XML-style tags to denote named
entities nearly doubled the length of the input sequences. This substantial increase in data size
meant that, given the available computational resources and time constraints, the NER-enhanced model
could not be trained for the same number of epochs as the baseline model.</p>
        <p>Consequently, the primary reason for the diminished scores is most likely that the model was
severely undertrained. An undertrained model would not have had suficient exposure to the data
to learn the complex patterns of the QA task, nor could it efectively learn to interpret the novel syntax
introduced by the NER tags. This issue was likely compounded by secondary factors, such as the
tokenizer splitting the unfamiliar tags into noisy subwords, creating an even more challenging learning
environment for a model that was already not receiving adequate training. Therefore, the combination
of a more complex input representation and an unavoidable reduction in training duration ultimately
led to the observed degradation in performance.</p>
        <p>Furthermore, it is noteworthy that the google/mt5-base model secured the highest performance
on the oficial FIRE test set, achieving the top rank among all participants across all four BLEU scores.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This research led to the successful creation and assessment of a hybrid BERT-IndicBART Question
Answering system tailored for the VATIKA dataset, proving its proficiency in delivering precise Hindi
answers to tourism-focused questions about Varanasi. The fundamental hybrid model, which paired
the l3cube-pune/hindi-bert-v2 model as an encoder with the ai4bharat/IndicBART model as a decoder,
yielded robust results across crucial metrics like Exact Match, F1 Score, BLEU, and ROUGE,
demonstrating its competence in managing domain-specific Hindi QA.</p>
      <p>A surprising and significant discovery from this work was the decline in performance that occurred
when Named Entity Recognition (NER) tags were inserted directly into the input text. Although the
NER modification was designed to ofer clear semantic pointers to the model, it invariably resulted
in lower scores on every evaluation metric. This result points to the inherent dificulties in feature
interaction and representation when merging external knowledge with pre-trained language models,
especially within the framework of low-resource languages such as Hindi. It emphasizes that the
method of presenting information to a model can be just as crucial as the information itself.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The authors would like to thank the organizers of the FIRE 2025 VATIKA shared task for providing the
dataset and evaluation framework. We also acknowledge the computational resources provided by our
institutions that made this research possible.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The author’s have not employed any Generative AI tools in the preparation of this work.
[7] D. Nadeau and S. Sekine, “A Survey of Named Entity Recognition and Classification,” Lingvisticae</p>
      <p>Investigationes, vol. 30, no. 1, pp. 3–26, 2007.
[8] A. Vaswani et al., “Attention Is All You Need,” Proc. NIPS, pp. 5998–6008, 2017.
[9] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural Networks,”</p>
      <p>Proc. NIPS, pp. 3104–3112, 2014.
[10] C. Rafel et al., “Exploring the Limits of Transfer Learning with a Unified Text-to-Text
Transformer,” J. Mach. Learn. Res., vol. 21, no. 140, pp. 1–67, 2020.
[11] A. Radford et al., “Language Models are Unsupervised Multitask Learners,” OpenAI Blog, vol. 1,
no. 8, p. 9, 2019.
[12] N. Arivazhagan et al., “Massively Multilingual Neural Machine Translation in the Wild: Findings
and Challenges,” arXiv preprint arXiv:1907.05019, 2019.
[13] L. Xue et al., “mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer,” Proc.
NAACL</p>
      <p>HLT, pp. 483–498, 2021.
[14] R. Joshi, “L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for
Devanagari based Hindi and Marathi Languages,” arXiv preprint arXiv:2211.11418, 2022.
[15] A. Mhaske, H. Kedia, S. Doddapaneni, et al., “Naamapadam: A Large-Scale Named Entity
Annotated Data for Indic Languages,” arXiv preprint arXiv:2212.10168, 2022.
[16] R. Dabre et al., “IndicBART: A Pre-trained Model for Natural Language Generation in 11 Indian</p>
      <p>Languages,” Proc. EMNLP, pp. 7479–7494, 2022.
[17] P. Gatla, Anushka, N. Kanwar, G. Sahoo, and R. K. Mundotiya, ”Tourism Question Answer System
in Indian Language using Domain-Adapted Foundation Models,” arXiv preprint, 2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , “BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          ,
          <source>” Proc. NAACL-HLT</source>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>“</surname>
            <given-names>BART</given-names>
          </string-name>
          :
          <article-title>Denoising Sequence-to-Sequence Pre-training for Natural Language Generation</article-title>
          , Translation, and Comprehension,
          <source>” Proc. ACL</source>
          , pp.
          <fpage>7871</fpage>
          -
          <lpage>7880</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          , et al.,
          <source>“Multilingual Denoising Pre-training for Neural Machine Translation,” Trans. ACL</source>
          , vol.
          <volume>8</volume>
          , pp.
          <fpage>726</fpage>
          -
          <lpage>742</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loschilov</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          , “Decoupled Weight Decay Regularization,
          <source>” Proc. ICLR</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , and W. Zhu, “
          <article-title>BLEU: a Method for Automatic Evaluation of Machine Translation</article-title>
          ,
          <source>” Proc. ACL</source>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          , “
          <article-title>ROUGE: A Package for Automatic Evaluation of Summaries,”</article-title>
          <source>Text Summarization Branches Out: Proc. ACL Workshop</source>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>