<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SVNIT_CSE: Building a question answering system for Hindi using word-embedding</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ankur Jariwala</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Siba Sankar Sahu</string-name>
          <email>sibasankar@coed.svnit.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, Sardar Vallabhbhai National Institute of Technology</institution>
          ,
          <addr-line>Surat</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Question answering (QA) is an important application in the text analysis domain that enables machines to provide precise and factual answers. The rich morphology and diverse syntactic structures within Indian languages present both challenges and opportunities for the development of efective QA systems. As part of the FIRE 2025 VATIKA shared task, our SVNIT_CSE team explored diferent word embedding models for the Hindi QA system. From the evaluation results, we found that FastText embeddings with cosine similarity approach outperform other methods and provide a BLEU score and  1 score of 89.5 and 0.9180 in test data I, respectively. Similarly, it provides the BLEU score and the  1 score of 43.7 and 0.433 in test data II, respectively. The word embedding model contributes to the building of scalable, robust, and inclusive QA systems that can support access to information for millions of Indian language speakers.</p>
      </abstract>
      <kwd-group>
        <kwd>Low resource languages</kwd>
        <kwd>Natural language processing</kwd>
        <kwd>Question answering system</kwd>
        <kwd>Word embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Question answering (QA) is an application of Information retrieval (IR) and Natural language processing
(NLP) that answer user generated natural language queries. In general, the QA system is divided into
two types based on the source of knowledge. The first type is extractive or closed domain; it uses
span extraction from the available context to answer questions. The second type is generative, or open
domain, which utilizes multiple large-scale data sources, such as the Web, documents, and journals, to
answer questions. As people want quick and relevant information, so the application of QA system in
diferent fields such as healthcare, education, business, and regular life through digital assistants.</p>
      <p>In traditional search engine, the user gets a list of documents, websites, or references based on
simple keyword matching; however, a QA system provides concise, accurate, and contextually relevant
answers by understanding the user’s query. The most important features of a QA system is their
ability to understand natural language queries and provide relevant information from structured or
unstructured sources. The QA systems and Generative AI based conversational tools are a lot diferent.
Large language models like ChatGPT would operate in a much broader scope and ability to keep track
of context across multiple turns and perform in a more natural, open-ended manner. However, the QA
system provides answers in a specific domain.</p>
      <p>
        Advanced search engines, virtual assistants, and a conversational AI system are built for high-resource
languages such as English and European languages [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, there is less availability of such
AI system in low- resource languages. Developing an AI system for low resource languages closes the
digital divide and makes it easier for people to access digital services, healthcare, and education. The
development of QA systems for low-resource languages is of immense importance in making information
access more inclusive and equitable. In this study, we explore a QA system for the low-resource Indian
language.
      </p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073</p>
      <sec id="sec-1-1">
        <title>1.1. Task Description</title>
        <p>
          FIRE 1 (Forum for Information Retrieval Evaluation) organized a shared task on the QA system for
Varanasi Tourism (VATIKA 2). The VATIKA [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] data set comprises ten diferent data from the tourism
domain. Domains are included, such as Ganga aarti, cruise, food court, public toilet, kund, museum,
general, ashram, temple, and travel. Each domain includes detailed paragraph-level Hindi contexts
followed by multiple QA pairs. The data set is divided into four main parts i.e; train, validation, test
data I, and test data II. QA pairs of diferent training data comprise the unique ID, question, and answer.
An example of a Hindi QA pair training data and test data is shown in Fig. 1 and Fig. 2. In test data II,
the user generates a predicted answer for a particular query. The statistic of the VATIKA data set is
shown in Table 1.
{
},
{
"context": "भागीरथ कंु ड प.ं दीन दयाल उपाध्याय रलेव े स्टेशन स े 14.1 िकलोमीटर दरू ह।ै स्टेशन स े कंु ड तक
पहुँचने के लए टैक्सी, कै ब, या बस स ेवाओं का उपयोग िकया जा सकता ह।ै यह स्टेशन पू व में मुगलसराय के नाम से जाना जाता
था और भारत के प्रमुख रले ज क्शंनों म ें से एक ह।ै यहाँ से भागीरथ कंु ड तक की यात्रा वाराणसी की ऐ तहा सक ग लयों और घाटों
के दृश्य प्रदान करती ह ।ै इस यात्रा म ें भक्तों को वाराणसी की सांस्क ृ तक िवरासत का अनुभव िमलता ह,ै जो इस धा मक स्थल के
महत्व को और भी बढ़ा द ेता ह।ै",
"qas": [
{
"id": "kund_636",
"question": "भागीरथ कंु ड प.ं दीन दयाल उपाध्याय रलेव े स्टेशन स े िकतना िकलोमीटर दरू ह?ै",
"answer": "भागीरथ कंु ड प.ं दीन दयाल उपाध्याय रलेव े स्टेशन स े 14.1 िकलोमीटर दरू ह।ै"
"id": "kund_637",
"question": "भागीरथ कंु ड प.ं दीन दयाल उपाध्याय रलेव े स्टेशन स े कै से पहुँच सकते ह।ै",
"answer": "भागीरथ कंु ड तक पहुँचने के लए प.ं दीन दयाल उपाध्याय रलेव े स्टेशन स े टैक्सी, कै ब, या बस
सेवाओं का उपयोग िकया जा सकता ह।ै"
        </p>
        <p>}
}
}
}
1https://fire.irsi.org.in/
2https://sites.google.com/view/vatika-2025/
},
{
},
{
}</p>
        <p>"context": "आिदके शव कंु ड, जो वाराणसी के प्राचीन राजघाट के िनकट स्थ त ह,ै वरुणा एव ं गगंा निदयों के पावन
संगम स्थल पर स्थ त एक अत्य तं धा मक और सांस्क ृ तक दृ ि से महत्वपू ण स्थान ह ।ै इसे भगवान िवष्ण ु को सम पत एक िदव्य
्सथल के रूप म ंे पूज्य माना जाता ह ।ै भारतीय सस्ंक ृ त एवं धमशास्तरों म ंे गगंा-वरुणा के इस स ंगम का अत्य धक आध्या त्म क एवं
सांस्क ृ तक महत्व ह ,ै जसे स्नाना िद कम के माध्यम स े पापों स े मुिक्त और पुण्य प्रा ि का अ भन् स्रोत माना जाता ह ।ै श्रदधालु इस
कंु ड में स्नान कर आत्मशु द्ध की अ भलाषा रखते हैं तथा इसे वाराणसी की समृद्ध ऐ तहा सक एवं धा मक िवरासत का अिनवाय अगं
मानते ह।ैं यह स्थल वाराणसी की ऐ तहा सक और धा मक धरोहर का िहस्सा ह ।ै",
"qas": [
{
"id": "kund_001",
"question": "आिदके शव कंु ड िकन निदयों के स ंगम पर स्थ त है और इसका धा मक महत्व क्या ह ?ै",
"answer": ""
"id": "kund_002",
"question": "आिदके शव कंु ड का संबधं िकस देवता से है और श्रदधालु यहाँ क्यों आते ह
"answer": ""
?ंै",
"id": "kund_003",
"question": "आिदके शव कंु ड को वाराणसी की धा मक िवरासत में कै से देखा जाता ह?ै",
"answer": ""
}</p>
        <p>The paper is organized as follows. Section 2 presents previous research conducted in the QA domain.
The preprocessing step and the model architecture are presented in Section 3. The experimental results
and their analysis are described in Section 4. Finally, we conclude with the direction of future work in
Section 5.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Early research work on QA systems relied heavily on human efort, both in data curation and annotation.
Researchers often used Wikipedia articles, newswire text, and encyclopedias as sources of factual
knowledge because there were not many large-scale machine-readable datasets available. In recent
years, several methods have been explored, such as machine learning, deep learning, and
transformerbased models to deal with the QA system. In this section, we look at some existing studies on QA
systems.</p>
      <p>
        QA systems for English and other high-resource languages have come a long way [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However,
a few research has been conducted on low-resource Indian languages. Nanda et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] implemented a
machine learning approach for the Hindi QA system. The approach includes accessing natural language
queries, feature extraction, and classification. They trained a Naive Bayes classifier to identify the
correct label class for a given query. They evaluated the performance on two test data and achieved an
accuracy of 92% and 88%. Hermjakob [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] combined the Penn treebank training corpus with the question
treebank for the classification of questions. The experimental results show that by adding question
sentences for training, the accuracy increases from 65.5% to 97.3%. Zhang and Lee [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] evaluated the five
machine learning algorithms: nearest neighbors, naïve bayes, decision tree, sparse network of windows,
and support vector machine. They trained the learning algorithms on diferent length of datasets and
tested on the TREC 3 dataset. They found that the SVM algorithm outperformed the four other methods
in the question classification.
      </p>
      <p>
        Abdel-Nabi et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] report a survey on diferent deep learning models to build a QA system. They
found that diferent deep learning models such as the convolutional neural network, recurrent neural
network, attention-based, hybrid models, graph-based models, generative model, and reinforcement
learning-based models were used for diferent QA systems. They evaluated the efectiveness of the QA
system in a wide range of evaluation metric. Tan et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] used the BiLSTM model to generate the
embeddings of the questions and answers. They used two data sets, i.e. TREC-QA 4 and InsuranceQA 5
to evaluate an QA system. The QA-LSTM/CNN with attention mechanism provides the best performance
on TREC-QA. Similarly, QA-LSTM with attention ofers the best performance in InsuranceQA. Tay et
al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] proposed holographic dual LSTM (HD-LSTM) for QA system. They found that extensive feature
engineering was not required to build a deep learning-based QA system. They noticed that the MAP
score for HD-LSTM is 0.7042 which outperformed the baseline LSTM (1 layer) with a MAP score of
0.6280. Similarly, the HD-LSTM MRR score is 0.7733 outperformed the baseline LSTM (1 layer) with a
MRR score of 0.6960. Peng and Liu [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] presented an attention-based convolutional neural network
model for the QA system. The experimental results show that the attention-based convolutional neural
network performs better than CNN and the LSTM model.
      </p>
      <p>
        Several benchmark data sets have been created to test cross-lingual abilities such as XQuAD (Artetxe
et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]), MLQA (Lewis et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]), and TyDi QA (Clark et al. [15]). These corpora use a
translationbased approach or parallel corpora to cover many languages. These resources enable a systematic
evaluation of cross-lingual transfer, where models are trained in high-resource languages and tested in
low-resource settings. The data set also helps advance research on multilingual QA. In Indian languages,
there is less availability of annotated QA datasets on the Web. Hence, the researcher explored zero-shot
transfer [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], few-shot fine-tuning [ 16], and synthetic data generation [17] to build their own customized
datasets. These strategies are helpful, but the more complex morphology of Indian languages often
limits their efectiveness.
      </p>
      <p>To address these limitations, several Indian language datasets have been developed. IndicSQuAD [18],
dataset is developed by translating the English SQuAD dataset into ten major Indian languages. The
resource is fine-tuned with both monolingual BERT and multilingual models such as MuRIL-BERT
(Khanuja et al. [19]). They found that language-specific BERT outperform MurilBERT in diferent Indian
languages. IndicQA is another tool that presents a standard for both extractive and abstractive QA in
3https://huggingface.co/datasets/CogComp/trec
4https://huggingface.co/datasets/lucadiliello/trecqa
5https://huggingface.co/datasets/deccan-ai/insuranceQA-v2
diferent Indian languages [ 20]. They explored diferent LLM for QA. They explored two inferences,
that is, translate test and direct test in diferent Indian languages. In the translate test, the input is
translated into English, and the output is translated into Indian language. In the direct test, both the
input and the output are in the native language. They found that the translate test provides better
performance than the direct test in all Indian languages.</p>
      <p>IndicBERT (Doddapaneni et al. [21]) is a lightweight transformer model built on the ALBERT
architecture and trained on IndicCorp, a large collection of monolingual Indian languages. IndicBERT
supports zero-shot transfer and cross-lingual generalization, allowing a model trained in one Indian
language to transfer knowledge to others. Compared to multilingual models such as mBERT [22] or
XLM-RoBERTa [23], IndicBERT requires fewer parameters while maintaining competitive performance,
making it more suitable for languages with low resources. Doddapaneni et al. [21] show that the
combination of IndicBERT and Samanantar provides a better avg  1 score than MuRIL and mBERT in
diferent Indian languages. Samanatar [ 24] is the largest publicly available parallel corpora collection
for eleven Indian languages. Together, these datasets and models are the building blocks for furthering
QA research in Indian languages.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Methodology</title>
      <p>We explore the QA system in three steps: data pre-processing, model design, and evaluation. In the
preprocessing step, the data set is tokenized, segmented, and presented in a structured format. The QA
model is designed using traditional similarity-based methods with modern embedding-based approaches.
Finally, we evaluate the QA system using standard evaluation metrics.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Preprocessing</title>
        <p>The data set is presented in SQuAD-like [25] JSON structure that contains context, question, and answer.
We implemented diferent preprocessing steps and presented the data in a structured tabular format.
Then, each QA pair is presented in the following way.</p>
        <p>Q = question, C= { c1,c2,…,cn }, Atext = ground truth answer text. where C is the set of candidate
sentences in the context. The goal is to identify the sentence in context C that best answers the question
Q. For every method, the question and context sentences are converted into a vector space, and the
similarity is measured.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Model Architecture</title>
        <p>Word embeddings are compact, low-dimensional vector representations of words that contain syntactic
and semantic characteristics of words within a continuous space. Embeddings place words with similar
meanings nearer to each other in the vector space. This feature enables them to be very eficient for
NLP activities such as named entity recognition [26] and sentiment analysis [27]. In this study, we
explore diferent word embedding models to capture semantic similarity between question and answer
and develop an eficient QA system.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. FastText embeddings with cosine similarity approach</title>
          <p>We use pre-trained FastText [28] embeddings 6 to represent words in a dense vector space. We tokenize
the input text, and the embedding method presents each word to a 300-dimensional embedding vector.
The mean embeddings of all words in the sentence to obtain sentence-level representations for both
the context and the question. The averaging produces fixed-size sentence vectors that capture the
semantic meaning of the text. For each question–answer pair, we represent the context and the question
as separate vectors and compute the cosine similarity between the question vector and the sentence
vectors of the context. The sentence with the highest cosine similarity score is selected as the predicted
6https://fasttext.cc/docs/en/crawl-vectors.html
answer. The approach assumes that semantically similar sentences, based on the embedding space, are
more likely to contain the correct answer span.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. FastText embeddings with machine learning approach</title>
          <p>We explore pre-trained FastText embeddings to represent text in a continuous vector space. Individually,
word is mapped to a 300-dimensional embedding. The sentence representations for both context and
question are received by averaging the embeddings of their constituent words. For every
question–answer pair, we calculate a context vector and a question vector, then concatenated to construct the final
feature representation. In the dataset, the annotated answer positions are used to make labels that
match the training samples. These concatenated vectors are used to train a ridge regression model. The
system learns to connect the input representation to the correct answer spans. The performance of the
model is evaluated using the predicted and actual answers.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>3.2.3. Word2Vec Embeddings</title>
          <p>We investigate pre-trained Word2Vec [29] embeddings to obtain distributed representations of words.
A fixed dimension dense vector is used to represent each token in the dataset, regardless of whether it
originates from the context or the question. Word2Vec captures semantic similarity by putting words
with similar meanings closer together in the embedding space. This is especially beneficial for dealing
with diferent Hindi vocabulary. The system uses these embeddings to turn both context passages and
questions into machine-readable form that keeps the meaning of the words. To make sentence-level
representations, we determine the average of the Word2Vec embeddings of all the tokens in a certain
text span. This gives us vectors of the same size for both the question and each sentence in the context.
The idea is that the aggregated embedding captures the main meaning of the text span, allowing to
compare the question and context in the same vector space. Then, we use cosine similarity to check
similarity in the sentence and question vector in the embedding space. We used the following parameter
to experiment with the Word2Vec model which is presented below in Table 2.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Analysis</title>
      <p>In the QA system, we explore diferent embedding-based approaches to extract the answers from the
context. To evaluate the efectiveness of the system, we use evaluation metrics such as the BLEU score,
the ROUGE score, and the  1 score. The evaluation score of diferent embedding models is presented in
Tables 3-8. Moreover, the efectiveness of diferent word embedding models is presented graphically in
Figures 3a-5b. From the evaluation results, we found that the FastText method with cosine similarity
(CS) outperforms other methods in both Test data I and II in terms of BLEU and  1 score. A better BLEU
and  1 scores says that FastText embeddings are better at getting exact matches and subword-level
accuracy. Word2Vec model provides the best ROUGE scores in test data I as shown in Table 5 whereas
Fasttext with machine learning ofers the best ROUGE scores in test data II as shown in Table 6. A better
ROUGE score indicates that FastText with the ML model captures broader contextual and structural
similarity. In Table 4, FastText with cosine similarity (CS), BLEU scores start relatively high at the
unigram level but decrease with higher n-grams, reaching very low at BLEU-4. The results demonstrate
that the ground truth answer is present in the system-generated output but not sequentially. From
the analysis of the results, we found that FastText with CS provides optimal performance on diferent
evaluation metric and provides competitive performance on shared task leaderboard. The evaluated
model is most suitable for the Hindi QA system.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>The question-and-answer is an important downstream task in the text analysis domain. In this study,
we explore various embedding-based methodologies for developing a QA system in Hindi. From the
evaluation results, we found that FastText with cosine similarity outperformed other methods, achieving
the highest BLEU and QA- 1 scores on diferent test data. In general, the results show that simpler
similarity-based methods provide better performance for Hindi QA than the traditional machine learning
method. In the future, we can explore transformer-based models to build the Hindi QA system and
improve the robustness in real-world applications.</p>
      <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT, QuillBot in order to: Spelling Check,
Paraphrase and reword. After using this tool/service, the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.
58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July
5-10, 2020, Association for Computational Linguistics, 2020, pp. 7315–7330.
[15] J. H. Clark, J. Palomaki, V. Nikolaev, E. Choi, D. Garrette, M. Collins, T. Kwiatkowski, Tydi QA: A
benchmark for information-seeking question answering in typologically diverse languages, Trans.</p>
      <p>Assoc. Comput. Linguistics 8 (2020) 454–470.
[16] J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, M. Johnson, XTREME: A massively multilingual
multi-task benchmark for evaluating cross-lingual generalisation, in: Proceedings of the 37th
International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume
119 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 4411–4421.
[17] R. Puri, R. Spring, M. Shoeybi, M. Patwary, B. Catanzaro, Training question answering models from
synthetic data, in: B. Webber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020,
Association for Computational Linguistics, 2020, pp. 5811–5826.
[18] S. Endait, R. Ghatage, A. Kulkarni, R. Patil, R. Joshi, Indicsquad: A comprehensive multilingual
question answering dataset for indic languages, CoRR abs/2505.03688 (2025).
[19] S. Khanuja, D. Bansal, S. Mehtani, S. Khosla, A. Dey, B. Gopalan, D. K. Margam, P. Aggarwal, R. T.</p>
      <p>Nagipogu, S. Dave, S. Gupta, S. C. B. Gali, V. Subramanian, P. P. Talukdar, Muril: Multilingual
representations for indian languages, CoRR abs/2103.10730 (2021).
[20] A. K. Singh, V. Kumar, R. Murthy, J. Sen, A. R. Mittal, G. Ramakrishnan, INDIC QA BENCHMARK:
A multilingual benchmark to evaluate question answering capability of llms for indic languages,
in: L. Chiruzzo, A. Ritter, L. Wang (Eds.), Findings of the Association for Computational
Linguistics: NAACL 2025, Albuquerque, New Mexico, USA, April 29 - May 4, 2025, Association for
Computational Linguistics, 2025, pp. 2607–2626.
[21] S. Doddapaneni, R. Aralikatte, G. Ramesh, S. Goyal, M. M. Khapra, A. Kunchukuttan, P. Kumar,
Towards leaving no indic language behind: Building monolingual corpora, benchmark and models
for indic languages, in: A. Rogers, J. L. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the
61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
ACL 2023, Toronto, Canada, July 9-14, 2023, Association for Computational Linguistics, 2023, pp.
12402–12426.
[22] S. Wu, M. Dredze, Beto, bentz, becas: The surprising cross-lingual efectiveness of BERT, in: K. Inui,
J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language Processing,
EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Association for Computational
Linguistics, 2019, pp. 833–844.
[23] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, in:
D. Jurafsky, J. Chai, N. Schluter, J. R. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Association for
Computational Linguistics, 2020, pp. 8440–8451.
[24] G. Ramesh, S. Doddapaneni, A. Bheemaraj, M. Jobanputra, R. AK, A. Sharma, S. Sahoo, H. Diddee,
M. J, D. Kakwani, N. Kumar, A. Pradeep, S. Nagaraj, D. Kumar, V. Raghavan, A. Kunchukuttan,
P. Kumar, M. S. Khapra, Samanantar: The largest publicly available parallel corpora collection for
11 indic languages, Trans. Assoc. Comput. Linguistics 10 (2022) 145–162.
[25] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: 100, 000+ questions for machine comprehension
of text, in: J. Su, X. Carreras, K. Duh (Eds.), Proceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016,
The Association for Computational Linguistics, 2016, pp. 2383–2392.
[26] A. Das, D. Ganguly, U. Garain, Named entity recognition with word embeddings and wikipedia
categories for a low-resource language, ACM Trans. Asian Low Resour. Lang. Inf. Process. 16
(2017) 18:1–18:19.
[27] T. P. Adewumi, F. Liwicki, M. Liwicki, Word2vec: Optimal hyper-parameters and their impact on</p>
      <p>NLP downstream tasks, CoRR abs/2003.11645 (2020).
[28] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword information,</p>
      <p>Trans. Assoc. Comput. Linguistics 5 (2017) 135–146.
[29] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eficient estimation of word representations in vector
space, in: Y. Bengio, Y. LeCun (Eds.), 1st International Conference on Learning Representations,
ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>An investigation of user attitudes toward search engines as an information retrieval tool</article-title>
          ,
          <source>Comput. Hum. Behav</source>
          .
          <volume>19</volume>
          (
          <year>2003</year>
          )
          <fpage>751</fpage>
          -
          <lpage>765</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. V.</given-names>
            <surname>Gysel</surname>
          </string-name>
          ,
          <article-title>Modeling spoken information queries for virtual assistants: Open problems, challenges and opportunities</article-title>
          , in: H.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>W. E.</given-names>
          </string-name>
          <string-name>
            <surname>Duh</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>M. P.</given-names>
          </string-name>
          <string-name>
            <surname>Kato</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          Poblete (Eds.),
          <source>Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2023</year>
          , Taipei, Taiwan,
          <source>July 23-27</source>
          ,
          <year>2023</year>
          , ACM,
          <year>2023</year>
          , pp.
          <fpage>3335</fpage>
          -
          <lpage>3338</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gatla</surname>
          </string-name>
          , Anushka,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kanwar</surname>
          </string-name>
          , G. Sahoo,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Mundotiya</surname>
          </string-name>
          ,
          <article-title>Tourism question answer system in indian language using domain-adapted foundation models, arXiv preprint (</article-title>
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Simmons</surname>
          </string-name>
          ,
          <article-title>Answering english questions by computer: a survey</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>8</volume>
          (
          <year>1965</year>
          )
          <fpage>53</fpage>
          -
          <lpage>70</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Kolomiyets</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Moens</surname>
          </string-name>
          ,
          <article-title>A survey on question answering technology from an information retrieval perspective, Inf</article-title>
          . Sci.
          <volume>181</volume>
          (
          <year>2011</year>
          )
          <fpage>5412</fpage>
          -
          <lpage>5434</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Nanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Singla</surname>
          </string-name>
          ,
          <article-title>A hindi question answering system using machine learning approach</article-title>
          , in: 2016
          <source>International Conference on Computational Techniques in Information and Communication Technologies (ICCTICT)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>314</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>U.</given-names>
            <surname>Hermjakob</surname>
          </string-name>
          ,
          <article-title>Parsing and question classification for question answering</article-title>
          ,
          <source>in: Proceedings of the ACL 2001 Workshop on Open-Domain Question Answering</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Question classification using support vector machines</article-title>
          ,
          <source>in: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval</source>
          , SIGIR '03,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2003</year>
          , p.
          <fpage>26</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Abdel-Nabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Awajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Z.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <article-title>Deep learning-based question answering: a survey</article-title>
          ,
          <source>Knowledge and Information Systems</source>
          <volume>65</volume>
          (
          <year>2023</year>
          )
          <fpage>1399</fpage>
          -
          <lpage>1485</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            dos
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Lstm-based deep learning models for non-factoid answer selection</article-title>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Phan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Tuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <article-title>Learning to rank question answer pairs with holographic dual lstm architecture</article-title>
          ,
          <source>in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '17</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2017</year>
          , p.
          <fpage>695</fpage>
          -
          <lpage>704</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Attention-based neural network for short-text question answering</article-title>
          ,
          <source>in: Proceedings of the 2018 2nd International Conference on Deep Learning Technologies, ICDLT '18</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2018</year>
          , p.
          <fpage>21</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Artetxe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ruder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yogatama</surname>
          </string-name>
          ,
          <article-title>On the cross-lingual transferability of monolingual representations</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          <string-name>
            <surname>Tetreault</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July</source>
          <volume>5</volume>
          -
          <issue>10</issue>
          ,
          <year>2020</year>
          , Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>4623</fpage>
          -
          <lpage>4637</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Oguz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rinott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          ,
          <article-title>MLQA: evaluating cross-lingual extractive question answering</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          <string-name>
            <surname>Tetreault</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>