<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Tool for Semantic Search in Bangla</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arup Das</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bibekananda Kundu</string-name>
          <email>bibekananda.kundu@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lokasis Ghorai</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arjun Kumar Gupta</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sutanu Chakraborti</string-name>
          <email>sutanuc@cse.iitm.ac.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Bangla Information Retrieval, Query Expansion, tf-idf, Latent Semantic Analysis, Explicit Semantic</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Development of Advanced Computing</institution>
          ,
          <addr-line>Kolkata - 700091</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science &amp; Engineering, Indian Institute of Technology Madras</institution>
          ,
          <addr-line>Chennai - 600096</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Language Processing</institution>
          ,
          <addr-line>ALTNLP</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>Bangla is a low-resource language that is highly agglutinative, and designing efective search and information retrieval systems over Bangla is quite challenging. This paper presents our explorations toward building অন্েবষা (Anwesha), a prototype for a search engine in Bangla. To the best of our knowledge, this search system is the first such initiative in Bangla that facilitates retrieval of semantically related documents by use of diverse knowledge sources like WordNet, statistical co-occurrences (by way of Latent Semantic Analysis (LSA)) and external knowledge sources like Wikipedia (by way of Explicit Semantic Analysis (ESA)). We also present our eforts to overcome the limitations of existing spell-check and lemmatization approaches in Bangla and integrate them into Anwesha. In addition, we also present methods to explain search results by highlighting keywords that LSA or ESA reckons to be semantically related to the query. Since there is no Gold standard dataset available to evaluate the efectiveness of Bangla information retrieval systems, we have created a dataset containing query document relevance pairs in two distinct domains. We analyze the system's performance on queries having diferent dificulty levels. Our technique could be adapted to facilitate efective semantic search in other low-resource, highly inflected languages.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Advancements in search engine technology over English are yet to translate to search over
documents in Indic languages, which are relatively low-resource. One such language, Bengali
(also called Bangla) is a highly agglutinative Indo-Aryan language with more than 160 inflected
forms for verbs, 36 forms for nouns, and 24 other forms for pronouns[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Bangla has two
prominent dialect variations: Sadhu bhasa1 and Chalit bhasa2. Being the fith most-spoken
CEUR
Workshop
Proceedings
native language with 300 million speakers globally, Bangla has witnessed the fastest growth
in internet users among the other Indic languages[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Given the ongoing eforts to digitise
Bengali literary works, there is a pressing need for tools that can facilitate semantic search
over these documents. This can also inspire research in Information Retrieval (IR) over other
low-resource, highly inflected languages similar to Bangla, such as Assamese, Maithili, Oriya,
Hindi, and Manipuri[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This paper presents our eforts toward building Anwesha, a Bangla
search engine. Though there exist Bangla search engines like Anwesan, Sandhan and Pipilika,
we attempt to address their shortcomings through Anwesha.
      </p>
      <p>Owing to Bangla’s agglutinative nature, scarce data, shortage of benchmark corpus, absence
of gold standard datasets for IR evaluation, and the lack of state of the art Bangla Language
Processing tools like stemmer or lemmatizer, Parts of Speech Tagger, Named Entity
Recogniser(NER) and sense disambiguation tools, there are several challenges to be resolved in search.
Our primary contributions are as follows. We show how Anwesha integrates diverse knowledge
sources like WordNet, statistical co-occurrences (by way of Latent Semantic Analysis (LSA))
and external knowledge sources like Wikipedia (by way of Explicit Semantic Analysis (ESA))
for facilitating efective retrieval. We present tools for Bangla spell-checking and lemmatization
that overcome the limitations of past approaches; these have been integrated into Anwesha. In
addition, we also present methods to explain search results by highlighting keywords that LSA
or ESA reckons to be semantically related to the query. Finally, we created a Gold Standard
dataset containing human relevance judgements over queries of varying complexity, for
evaluation purposes.</p>
      <p>In the remainder of the paper, we first discuss the knowledge required to understand the vector
space approaches used in Anwesha and the related work done in Bangla Search in section 2.
The methodology used to build Anwesha is described in section 3. In section 4, we define
the creation of a Gold standard dataset to evaluate the Bangla IR. We present our analysis of
Anwesha’s performance using diferent query complexities in section 5. In section 6, we
present the plans for further improvement in Anwesha.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Literature Survey</title>
      <p>
        Knowledge sources: Anwesha makes use of three vector space approaches for retrieval. The
ifrst is a naïve approach where the strength of association of a term to a document is captured
using term frequency and inverse document frequency (tf-idf)3. The second approach uses
LSA[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to perform dimensionality reduction. Both terms and documents are represented as
linear combinations of underlying concepts. This facilitates retrieval of documents that do
not explicitly contain the words in the query but share higher-order co-occurrences with the
query words. The third approach is ESA[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] which exploits knowledge of Wikipedia. Terms and
documents are expressed in terms of underlying concepts as in LSA, but in ESA these concepts
are interpretable. Each concept corresponds to a Wikipedia article name. Treating Wikipedia
as a corpus, the strength of association of a term to a Wikipedia article is estimated using its
tf-idf score with respect to that article.
      </p>
      <p>
        3https://www.uio.no/studier/emner/matnat/ifi/IN4080/h18/lectures/vector1-%281%29.pdf
In addition to these approaches, we also integrated a WordNet-based query expansion that
helps users articulate their queries better by adding words related to the query words. The Lesk
algorithm, a knowledge-based approach, uses a thesaurus or a dictionary as external knowledge
for Word Sense Disambiguation [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We have used the IndoWordNet4 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a WordNet for Indic
languages, to assign the context-appropriate meanings to words in the query. There are 36346
synsets and 45497 unique words covered for Bangla in the IndoWordNet as of 13 April 2022. A
Python-based API called “pyiwn”5 was used to access the IndoWordNet6. We have used the
adapted Lesk algorithm [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] as it overcomes the original and simple Lesk algorithm limitations
by further adding lemma names from hypernyms, hyponyms, holonyms and meronyms.
Spell-checker: We devised our own Bangla spell-checker that was integrated into Anwesha.
There are a few spell-checkers in Bangla. The one devised by Rakib Naushad[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] uses a Unicode
dictionary to detect non-word errors and lists all the words with minimum Levenshtein distance
as candidate solutions. However, it was unable to perform spell corrections over
similarsounding characters like ন/ Na and ণ/ Ṇa and words having diferent grapheme representation
and similar phonetic utterances like সহজ/ Sahaja(EN: Easy) and শহজ/ Śahaja(EN: Easy), the
former being the correct spelling. Our spell-checker handles typographic and orthographic
mistakes to perform context-insensitive spelling correction and uses a double Metaphone
algorithm to handle phonetic errors. We present the details of our implementation in 3.
Lemmatization: Bengali is a highly inflectional language having 70% inflected words [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Reducing words to their roots tends to improve the efectiveness of retrieval. We have surveyed
two existing Bengali lemmatizers: BNLTK7uses a valid sufix list for nominal inflections and a
mapping table for verbal inflections, whereas BIRS[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] uses a Trie data structure for longest
prefix matching ; it simultaneously chops characters sequentially from the beginning and end
of the input word and attempts to match it with a valid root word from the corpus and finally
considers the output with minimum Edit Distance from the input word. However, both of these
implementations were inaccurate in several cases; for instance, in the case of BIRS, we got output
as উত্ তরপদ/ Uttarapada(EN: the answer) instead of উত্ তরপ্রেদশ/ Uttarapradesa(EN: Uttar
Pradesh) for the input word উত্ তরপ্রেদেশর/ Uttarapradesera(EN: of Uttar Pradesh) as it failed
to detect proper nouns. Similarly, it mapped a noun input word বীমার/Bimara (EN: of Insurance)
to a verb মার/ Māra(EN: Beat) instead of mapping it to the correct root word বীমা/ Bimar(EN:
Insurance) as it failed to detect the parts of speech. So we developed a new lemmatizer for
Anwesha, where we used a combination of valid sufix stripping and Edit Distance mechanism,
and achieved more accurate results.
      </p>
      <p>
        Existing Bangla search engines: Anwesan8 is a digital library and a search engine for the
Rabindra Rachanabali collection[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. It uses Lucene Search Engine Library and the DSpace
framework for searching and indexing purposes. Presently the website is inaccessible for
exploration. Sandhan9 is a monolingual search engine restricted to tourism and health-related
4The oficial website and the web interface of IndoWordNet: https://www.cfilt.iitb.ac.in/indowordnet/
5pyiwn GitHub repository: https://github.com/riteshpanjwani/pyiwn
6Python notebook demonstrating usage of pyiwn in accessing IndoWordNet: https://github.com/cfiltnlp/pyiwn/
blob/master/examples/example.ipynb
7Bengali Lemmatizer by Anirudh Adhikary: https://github.com/banglakit/lemmatizer
8http://anwesan.iitkgp.ernet.in
9http://sandhan.tdil-dc.gov.in/Search
domains based on the Bag of Words model. It is focused more on enhancing recall compared
to precision, and hence the top results do not always appear to be very relevant to the query.
Also, Sandhan does not seem to understand user intent appropriately, even for tourism-related
queries. For example, a query related to the Taj Mahal not having the Taj Mahal explicitly
specified in the query cannot fetch any relevant result. The query "আগ্রা় য অবসি্থত এক
িবখ্যাত সম্িৃতসৌধ"/ āgrāẏa abasthita ēka bikhyāta smr̥tisaudha(EN: A Famous Memorial in
Agra) does not return anything relevant to Taj Mahal. Also, there is no spell-check mechanism
in Sandhan. Pipilika10 is a search engine launched in Bangladesh on April 13, 2013, and
primarily crawl data from Bangla News, Bangla Blogs, and Bangla Wikipedia. It reports data of
interest to the residents of Bangladesh. Piplika performs query expansion[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] using a pseudo
relevance feedback mechanism. However, unlike Anwesha, it falls short in terms of explicitly
incorporating knowledge of statistical co-occurrences, background and linguistic knowledge
(as in Wikipedia and IndoWordNet respectively).
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Methodology</title>
      <p>
        At a high level, Anwesha has the following three components:
1. Query and document preprocessing: The spell-check program detects the non-words in the
query and suggests the candidate words to the user. It first transliterates the valid Bangla words
obtained from IndicCorp11 and IndoWordNet into English and ranks the candidate solutions in
non-increasing order of their scores calculated using a Bayesian spell-check mechanism[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. It
also uses a double Metaphone algorithm to handle phonetic errors. If the user settings allow for
it, the query expansion module adds related terms to the query based on IndoWordNet similarity.
The matching algorithm assigns these additional terms lower weights compared to terms in
the query. The query and the documents in the test collection are preprocessed in five steps:
text normalisation[15][16], elimination of punctuation symbols, word tokenization, stopword
removal12 and lemmatization. The lemmatization process identifies the parts of speech of the
words. Then, it removes the nominal sufixes from noun words and finds the verbal inflections
for the verbs using a dictionary. If the resulting word is present in the root word corpus, it
considers the input word as a lemma; else, it finds the possible candidate keys and outputs the
word with minimum edit distance.
2. Search algorithm and relevance estimation: Cosine similarity is used over all three
vector space approaches discussed before (tf-idf, LSA and ESA) for retrieval and ranking. We
observed that a document is often ranked high even if it contains only a few words from the
query if those words have a high presence in the document. In order to prefer documents that
have more query words over those that have only a few, we have defined the relevance score
as the harmonic mean of the cosine similarity and a normalized score based on the number of
words from the query that is present in the document. For LSA, 600 concept dimensions were
used, as they yielded the best results.
3. Displaying the top ten relevant results: We rank the documents in non-increasing order
10https://pipilika.com/
11https://storage.googleapis.com/ai4bharat-public-indic-nlp-corpora/indiccorp/bn.tar.xz
12https://github.com/stopwords-iso/stopwords-iso
of their relevance scores and display the top ten documents. We explain the search results by
highlighting the keywords in the ranked documents. This is straightforward in the case of tf-idf,
the words in the query are highlighted. In the case of LSA, we first represent the concepts both
in the query and document as linear combinations of words. Then, we calculate the Hadamard
product between these representations and highlight words in the document with the highest
coeficients. In ESA, we highlight the words in a document whose representations in the concept
space have the highest cosine similarity with the concept representation of the query vector.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Gold Standard Dataset Preparation</title>
      <p>There are several IR test collections available in English13. Unfortunately, there is no Gold
standard dataset available to test the efectiveness of Bangla IR. So, we have created a document
collection containing 182 short stories, novels and essays written by Rabindranath Tagore14
and 1000 newspaper articles published in 2013 crawled from the Bangla newspaper Prothom
Alo15. The collection contains 100 newspaper articles each from one of the ten categories:
বাংলােদশ/ Bānlādēśa(EN: ‘Bangladesh’), েখলা/ khēlā(EN: ‘sports’), িবজঞ্ান ও প্রযকু্ িত/ bijñāna
ō prayukti(EN: ‘technology’), িবেনাদন/ binōdana(EN: ‘entertainment’), আন্তর্জািতক/
āntarjātika(EN: ‘international’), অর্থনীিত/ arthanīti(EN: ‘economy’), জীবনযাপন/ jībanayāpana(EN:
‘life-style’), মতামত/ matāmata(EN: ‘opinion’), িশক্ ষা/ śikṣā(EN: ‘education’) and আমরা/ āmarā(
EN:‘we-are’). Using a restricted number of documents helped examine the results and focus on
precision-oriented measures. Rabindranath Tagore’s work has two diferent dialect variations:
Sadhu bhasa(101 documents) and Chalit bhasa(81 documents). Compared to the news articles,
the literary documents are very lengthy and constitute 69.32% of the unique words in the
vocabulary. We designed 94 queries, 26 queries belonging to complexity levels 1 and 2, 19
queries in complexity level 3 and 23 queries in complexity level 4. The definition of complexity
levels is shown in Table 1. We obtained graded relevance judgement from the Bangla users on
each of the top ten retrieved documents our search algorithms considered relevant to a query.
The users rated every document as either highly relevant (by assigning a score of 3), reasonably
or partially relevant (by assigning a score of 2) or irrelevant (by assigning a score of 1). We
collected at least five user responses for every query document pair and determined the mean
of the user ratings to calculate the final relevance of a document to a query.</p>
      <p>13http://ir.dcs.gla.ac.uk/resources/test_collections/
14https://rabindra-rachanabali.nltr.org/
15https://www.prothomalo.com/</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Analyses</title>
      <p>There are ten queries from Rabindranath Tagore’s work in complexity levels 1 and 4 and seven
queries in complexity levels 2 and 3. Further, three queries in complexity level 1 belong to
Sadhu bhasa. We take eighteen queries from every complexity level to evaluate Anwesha’s
performance with respect to mean average precision(MAP), normalized discounted cumulative
gain(nDCG) and mean precision. We present our results in Table 2. Lemmatization boosted
the MAP scores by 5.56% and 8.31%, and nDCG scores by 3.33% and 4.82% on an average when
using tf-idf and LSA respectively. We observe that LSA outperforms tf-idf over MAP and mean
precision with the increase in the complexity level of the queries.</p>
      <p>In Figure 1(a), we present results showing the efectiveness of query expansion on a
representative set of fourteen queries, seven from each of the two categories. The queries from both
categories had the same query intent. However, in the second category, some words in the query
were substituted with synonymous terms. IndoWordNet augments queries with synonymous
terms after lemmatization. This case study helped confirm that IndoWordNet enhances the
user query when it does not contain terms that precisely match the content of the relevant
documents.</p>
      <p>In Figure 1(b), we compare the efectiveness of tf-idf and LSA on seven direct queries and seven
indirect queries on related themes. We observe that tf-idf outperforms LSA when the queries
are precise. In contrast, LSA consistently outperforms tf-idf when the queries and the retrieved
relevant documents do not share many words; instead, they share a common theme.
Figure 2 shows a snapshot of the user interface of Anwesha. The user can choose one of the
four options (tf-idf, IndoWordNet based query expansion, LSA and ESA) for retrieval. The figure
shows how a query "ফু টবল সংক্ রান্ত খেবর" / Phuṭabala saṅkrānta khabēra(EN: Football related
news) undergoes spelling correction (খেবর/khabēra →খবর/ khabara) and the search results are
explained. This is achieved by highlighting the keywords in a top retrieved document that LSA
or ESA reckon to be semantically related to the query. In Figure 2, the keywords highlighted
by LSA are দল/ Dala(EN: team), েকাচ/ kōca(EN: coach) and সাফ/ sāpha(EN: SAFF-a famous
football tournament). Similarly, beneficiary keywords like েগােল/ gōlē(EN: goal), িলেগ/ ligē(EN:
league), ক্ যাম্েপ/ kyāmpē(EN: camp), িমডিফল্ডার/ miḍaphilḍāra(EN: midfielder), েপেলগ্িরিন/
pēlēgrini(EN: Pellegrini - a Chilean professional football manager) and ক্ যাথিলক/ kyāthalika(EN:
catholic) were identified in remaining top retrieved documents. Interestingly, many of these
words were not explicitly present in the query but are easily seen to be relevant to the query.
Techniques like ESA help in integrating background knowledge from sources like Wikipedia
to facilitate more efective retrieval. Table 3 illustrates the efectiveness of ESA on five queries
in complexity level 4 that benefit from such background knowledge. We observe that ESA
outperforms tf-idf, query expansion using IndoWordNet and LSA by a wide margin.
There is no silver bullet that works the best across all types of queries and user requirements. An
approach like tf-idf works the best on a precise query like “পন্িটংেয়র ফার্সট্ক্ লাস ক্ িরেকেটর
েশষ িদন”/ Panṭinẏēra phārsṭaklāsa krikēṭēra śēṣa dina(EN: The last day of Ponting’s first-class
cricket). This is a query in complexity level 1 where only one relevant document(2213) has
to be retrieved from the corpus. For the same query articulated as a complexity level 2 query
such as "িরিক পন্িটংেয়র ক্ িরেকট মােঠ েশষ িদন"/ Riki panṭinẏēra krikēṭa māṭhē śēṣa dina(EN:
Ricky Ponting’s last day on the cricket field), we find that both tf-idf and LSA are successful
in retrieving the relevant document. When the query is reformulated as a complexity level 3
query as "িরিক পন্িটংেয়র েখলার মােঠ েশষ িদন"/ Riki panṭinẏēra khēlāra māṭhē śēṣa dina(EN:
Last day of Ricky Ponting on the playground) we find tf-idf, LSA and ESA retrieving the
relevant document. However, when the query is represented as a complexity level 4 query, for
instance, "অসট্্ েরিলয়ার িবশ্বকাপজয়ী বয্াটসম্যান অিধনায়েকর অবসর"/ Asṭrēliẏāra
biśbakāpajaẏī byāṭasamyāna adhināẏakēra abasara(EN: Retirement of Australia’s World Cup-winning
batsman captain), it requires background knowledge about Ricky Ponting in that he was an
Australian World Cup winning captain and a renowned batsman. Hence only ESA was able to
retrieve the relevant document. All the other three approaches failed here. It may be noted that
due to the absence of relevant Wikipedia articles related to the literary works of Rabindranath
Tagore, we have used ESA only on the 1000 news articles.</p>
      <p>
        We independently evaluated the success of spell-check and lemmatization. Compared to Rakib
Naushad’s approach[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] our spell-check algorithm was much faster and it boosted the mean
reciprocal rank scores by 17.31% on a list of 2019 misspelt words. Our lemmatizer produced an
accuracy of 88% as opposed to 73% by BIRS[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>To the best of our knowledge, ours is the first efort to incorporate knowledge of IndoWordNet,
Wikipedia and statistical co-occurrences to facilitate semantic search in Bangla and allow for
an explanation of retrieved results by highlighting terms reckoned to be relevant to the query
by various approaches. We have also compiled relevance judgements over queries at diverse
complexity levels to create a Gold Standard dataset for evaluation and used this for systematically
analysing our results. Our technique could be adapted to facilitate efective semantic search in
other low-resource, highly inflected languages. As part of future work, we intend to use NER
which should help in containing indiscriminate query expansion using IndoWordNet which has
adversely afected query expansion efectiveness in select cases in our current implementation.
We also intend to handle idiomatic or multiword expressions and integrate relevance feedback
to further improve the efectiveness of our IR system. We also need to conduct a more thorough
evaluation of ESA over a wider class of queries. We have addressed the limitations of existing
lemmatization and spellcheck algorithms in our current work. Context-sensitive spellcheck
may be explored in future. The dataset16, code and a detailed analysis of our work are present
here: https://github.com/ArupDas15/Anwesha.</p>
      <p>16Dataset available in Zenodo: https://doi.org/10.5281/zenodo.6583149
[15] A. Kunchukuttan, The IndicNLP Library, https://github.com/anoopkunchukuttan/indic_
nlp_library/blob/master/docs/indicnlp.pdf, 2020.
[16] S. Alam, T. Reasat, A. S. Sushmit, S. M. Siddique, F. Rahman, M. Hasan, A. I. Humayun, A
large multi-target dataset of common bengali handwritten graphemes, in: International
Conference on Document Analysis and Recognition, Springer, 2021, pp. 383–398.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Choudhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sarkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Basu</surname>
          </string-name>
          ,
          <article-title>Inflectional morphology synthesis for bengali noun, pronoun and verb systems</article-title>
          ,
          <source>in: In Proceedings of the national conference on computer processing of Bangla</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>34</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>KPMG</surname>
          </string-name>
          ,
          <article-title>Indian languages- defining india's internet</article-title>
          , https://assets.kpmg/content/dam/ kpmg/in/pdf/2017/04/Indian-languages
          <article-title>-</article-title>
          <string-name>
            <surname>Defining-</surname>
          </string-name>
          Indias-Internet.pdf, Accessed:
          <fpage>2021</fpage>
          -10- 01,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hai</surname>
          </string-name>
          , L. Ray, Bengali Language Handbook, Center for Applied Linguistics,
          <year>1966</year>
          . URL: https://files.eric.ed.gov/fulltext/ED012914.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Deerwester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Dumais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. W.</given-names>
            <surname>Furnas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. K.</given-names>
            <surname>Landauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Harshman</surname>
          </string-name>
          ,
          <article-title>Indexing by latent semantic analysis</article-title>
          ,
          <source>Journal of the American Society for Information Science</source>
          <volume>41</volume>
          (
          <year>1990</year>
          )
          <fpage>391</fpage>
          -
          <lpage>407</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Gabrilovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Markovitch</surname>
          </string-name>
          ,
          <article-title>Computing semantic relatedness using wikipedia-based explicit semantic analysis</article-title>
          ,
          <source>in: IJCAI</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lesk</surname>
          </string-name>
          ,
          <article-title>Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone</article-title>
          ,
          <source>in: Proceedings of the 5th Annual International Conference on Systems Documentation, SIGDOC '86</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>1986</year>
          , p.
          <fpage>24</fpage>
          -
          <lpage>26</lpage>
          . URL: https://doi.org/10.1145/318723.318728. doi:
          <volume>10</volume>
          .1145/318723.318728.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharyya</surname>
          </string-name>
          , Indowordnet, in
          <source>: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>3785</fpage>
          -
          <lpage>3792</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pedersen</surname>
          </string-name>
          ,
          <article-title>An adapted lesk algorithm for word sense disambiguation using wordnet</article-title>
          ,
          <source>in: Procs. of CICLing</source>
          <year>2002</year>
          ,
          <year>2002</year>
          , pp.
          <fpage>136</fpage>
          -
          <lpage>145</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Noushad</surname>
          </string-name>
          , Bangla spell checker, https://github.com/RakibNoushad/Bangla-Spell-Checker,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chakrabarty</surname>
          </string-name>
          , U. Garain,
          <article-title>Benlem (a bengali lemmatizer) and its role in wsd</article-title>
          ,
          <source>ACM Trans. Asian Low-Resour. Lang. Inf. Process</source>
          .
          <volume>15</volume>
          (
          <year>2016</year>
          ). URL: https://doi.org/10.1145/2835494.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kowsher</surname>
          </string-name>
          , I. Hossen,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <article-title>Bengali information retrieval system (birs) (</article-title>
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .5121/ijnlc.
          <year>2019</year>
          .
          <volume>8501</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <article-title>Anwesan: A search engine for bengali literary works</article-title>
          ,
          <source>World Digital Libraries</source>
          <volume>5</volume>
          (
          <year>2012</year>
          )
          <fpage>11</fpage>
          -
          <lpage>18</lpage>
          . doi:
          <volume>10</volume>
          .3233/WDL- 120003.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Talha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <article-title>Query expansion for bangla search engine pipilika</article-title>
          ,
          <source>in: 2020 IEEE Region 10 Symposium (TENSYMP)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1367</fpage>
          -
          <lpage>1370</lpage>
          . doi:
          <volume>10</volume>
          .1109/TENSYMP50017.
          <year>2020</year>
          .
          <volume>9231043</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>M. D. Kemighan</surname>
            ,
            <given-names>K. W.</given-names>
          </string-name>
          <string-name>
            <surname>Church</surname>
            ,
            <given-names>W. A.</given-names>
          </string-name>
          <string-name>
            <surname>Gale</surname>
          </string-name>
          ,
          <article-title>A spelling correction program based on a noisy channel model</article-title>
          ,
          <source>in: COLING 1990</source>
          Volume
          <volume>2</volume>
          : Papers presented to the
          <source>13th International Conference on Computational Linguistics</source>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>