<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Forum for Information Retrieval Evaluation, December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>InfoTextCM: Addressing Code-Mixed Data Retrieval Challenges via Text Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aditya Sharma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Birla Institute of Technology and Sciences</institution>
          ,
          <addr-line>Pilani</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <fpage>2</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>Code-mixed IR in Bengali and English requires specialized methodologies, as conventional IR systems often face challenges due to the complexity of code-mixing, including language ambiguity, syntactic diversity, and vocabulary overlap. This paper evaluates several approaches for code-mixed retrieval, focusing on using advanced techniques such as Sentence-BERT. Specifically, we tackle the task of retrieving relevant documents for code-mixed Bengali-English queries from a large corpus. We explore various models, including traditional retrieval techniques like BM25 and PL2, and compare them against transformer-based approaches like Sentence-BERT. Additionally, we investigate the efectiveness of graph neural network-based models, leveraging semantic relationships to enhance retrieval performance. Our findings demonstrate that transformer-based models are able to handle the intricacies of code-mixed language, providing more accurate and context-aware retrieval results.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Code-mixed</kwd>
        <kwd>Information-retrieval</kwd>
        <kwd>SBERT</kwd>
        <kwd>GNN</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent years, multilingual and code-mixed material has become much more common on blogs, social
media, and other digital platforms. In multilingual cultures, code-mixing—the practice of speakers
switching between two or more languages during a single text or conversation—is particularly common.
Bengali-English is one of the most common code-mixed language pairings in Bangladesh and India. It
ofers a distinct set of possibilities and problems for IR (IR) tasks in natural language processing (NLP).
Conventional IR systems are primarily made for monolingual datasets; however, as code-mixed language
becomes more common, new methods that can handle multilingual and mixed-lingual situations must
be developed.</p>
      <p>The work of IR becomes challenging in the case of Bengali-English code-mixing because of the
morphological complexity of Bengali and the grammatical structural disparities between Bengali and
English. English has a more rigorous word order and uses less inflection than Bengali, which is an
inflectional language where the meaning of words frequently alters depending on their place in a
sentence. When users mix these languages, the retrieval system must eficiently manage tokenization,
parsing, and comprehension of both languages in real-time. Transliteration, in which Bengali words
are transcribed using the English alphabet, presents additional dificulties since it makes it challenging
to distinguish between the languages during processing. For instance, "tin" means "three" in Bengali,
but "metal container" or "chemical element" in English. Furthermore, the Bengali word "tin" may also
be transcribed as "teen," which again connotes "youngster" in English.</p>
      <p>While much previous research has concentrated on creating language models and retrieval systems for
specific languages, the code-mixed scenario calls for a blend of linguistics, machine learning, and natural
language processing (NLP) methods. The challenges of dealing with code-mixed data have shown
potential for techniques like language identification, transliteration normalization, and multilingual
embeddings. Still, little is known about how these methods may be used in an IR system, particularly
for low-resource languages like Bengali.</p>
      <p>This paper aims to explore and propose solutions for the task of IR in code-mixed Bengali-English
data. We investigate using a fine-tuned large language model, particularly BERT, to improve retrieval
performance in this challenging context. By tackling the unique linguistic challenges posed by
codemixed data, we hope to contribute to developing more inclusive and efective IR systems for multilingual
and code-mixed communities.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task Description</title>
      <p>The task is to create an information retrieval (IR) model capable of handling previously unseen queries
efectively. To mathematically formulate the problem of IR, we can define the components involved in
the retrieval process. Let’s denote:
• : a set of documents, where  = 1, 2, . . . ,  ,  is the total number of documents in the
corpus.
• : a set of queries, where  = 1, 2, . . . ,  and M is the total number of queries.
• (, ): a relevance function that scores the relevance of document  with respect to query .
The objective is to retrieve the most relevant documents for each query in  from the corpus . For
each query  ∈ , we want to rank the documents in  based on their relevance scores. This can be
formulated as:</p>
      <p>( ) = (,  )
The documents are then ranked based on the computed relevance scores:</p>
      <p>() = ((1), (2), . . . , ( ))
where the sorting is performed in descending order of relevance scores. For a given threshold  , we
can define a retrieval function that selects documents above this threshold:
(1)
(2)
() =  ∈ ( )
(3)
This configuration mimics real-world IR problems when models have to find the most pertinent articles
from sizable corpora in response to user queries that the model hasn’t seen in training.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Related Work</title>
      <p>
        Code-mixed language processing has attracted a lot of attention since multilingual interactions are
becoming more and more common in casual communication, especially on social media. Transformers,
including BERT[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], have shown promise in various tasks involving code-mixed languages. For example,
a study by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] focused on transformer-based language identification for Malayalam-English code-mixed
text, demonstrating the model’s eficacy in code-mixed IR tasks. [ 3] examines the performance and
behaviour of BERT in ranking tasks, specifically focusing on MS MARCO passage reranking and TREC
Web Track ad hoc document ranking. BERT distributes attention between query-document tokens and
favours semantic matches between paraphrase tokens, distinguishing itself from other neural rankers.
The authors concluded that BERT demonstrates strong efectiveness in question-answering focused
passage ranking tasks (MS MARCO) but is less efective in ad hoc document ranking (TREC).
      </p>
      <p>BERT and its variants have been extensively used for handling code-mixed text. For instance, [4]
utilized a multilingual BERT for sentiment analysis of COVID-19 vaccination-related code-mixed Bangla
text from social media. Another study introduced Tri-Distil-BERT [5], a multilingual model pre-trained
on Bangla, English, and Hindi, specifically fine-tuned for code-mixed text. Handling code-mixed text
poses unique challenges, such as language identification, transliteration, and context switching. [ 6]
introduced a character-level inclusive transformer architecture to improve IR in low-resource code-mixed
languages, demonstrating the need for specialized models. SetBERT [7] is a specialized version of the
BERT model, fine-tuned to improve query embeddings for set operations and Boolean logic queries like
Intersection (AND), Diference (NOT), and Union (OR). Traditional and neural retrieval methods often
struggle with these types of queries, but SetBERT addresses this gap by using an innovative inversed
contrastive loss to identify negative sentences. Combined with fine-tuning on a dataset generated
via prompt GPT, this approach significantly enhances retrieval performance. [ 8] proposes a Graph
Embedding-based Ranking Model for Product Search (GEPS) that uses graph embedding techniques
to incorporate graph-structured data into neural retrieval models. The approach helps overcome the
long-tail problem of click-through data and incorporates external heterogeneous information to improve
search results. [9]addresses the high computational complexity of re-ranking in image retrieval and
suggests a Graph Neural Network (GNN) method to increase efectiveness. Retrieving high-quality
samples and upgrading features make up the re-ranking process. Creating a k-nearest neighbour graph
is the first step, and message dissemination inside the graph is the second.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Background</title>
      <sec id="sec-4-1">
        <title>4.1. Traditional Information Retrieval models</title>
        <p>Term Frequency-Inverse Document Frequency (TF-IDF) is a statistical measure that assesses the
significance of a term in a document relative to a corpus. It consists of two components: Term Frequency
(TF), which quantifies how often a term appears in a document, and Inverse Document Frequency
(IDF), which reflects the term’s rarity across the corpus. A higher TF-IDF score indicates greater
relevance of the term to the document. BM25 (Best Matching 25) [10] is a probabilistic retrieval
model that enhances TF-IDF by incorporating term saturation and document length normalization.
It utilizes parameters to adjust the influence of term frequency and document length, improving the
efectiveness of document ranking. PL2 (Poisson Model) [ 11] is a probabilistic retrieval model based
on the Poisson distribution, which assesses the likelihood of observing a term in a document given
its frequency across the collection. This approach provides a statistical basis for scoring document
relevance. InL2 (Independence Language Model) [12] builds on the independence language modelling
approach, estimating the probability of generating a query from a document’s language model. It
employs smoothing techniques to address the presence of terms not found in the document. Hiemstra_LM
[13] extends language modelling in IR, focusing on the relationship between query terms and document
relevance within a probabilistic framework. It emphasizes query likelihood and employs a variant of
Dirichlet smoothing to refine the document’s language model based
on overall collection statistics.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Sentence-BERT</title>
        <sec id="sec-4-2-1">
          <title>SBERT [14] uses a Siamese neural network made up of two BERT [1]</title>
          <p>sub-models with identical configurations, parameters, and weights
to evaluate semantic similarity on a large scale eficiently. Also, the
updating of parameters occurs simultaneously in both sub-models.</p>
          <p>SBERT, like BERT, also trains pairwise, but it doesn’t compare
every pair. Instead, it applies a triplet loss function to compare an
anchor baseline input with a positive (true) and negative (false)
input. To achieve the goal, the spatial distance between the anchor
and negative input must be increased as much as possible, while
the spatial distance between the anchor and positive input must
be reduced. The triple loss function can be written as:
(, ,  ) = ( ‖ () −  ( )‖2 − ‖  () −  ( )‖2 + , 0)
(4)
where  indicates the bufer(margin) used to increase the gap between similar and diferent triplets.,
 (),  ( ),  ( ) are embeddings for anchor, positive and negative sentences.</p>
          <p>SBERT incorporates a pooling operation to the output of BERT models, which involves computing
the mean of all the output vectors. This process results in deriving fixed-sized embeddings that are
semantically meaningful for each sentence. The embeddings produced exhibit the property whereby
sentences with similar meanings are situated in close proximity to one another within the vector space.
Following pooling, we get two embeddings that correspond to both sentences. These two embeddings
are then concatenated and trained using the softmax loss function. After training, when the model
makes a prediction, it compares both embeddings using cosine similarity, resulting in a similarity score
for this pair of sentences.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Graph Neural Networks</title>
        <p>Graph Neural Networks (GNNs) are a class of neural networks designed to operate on graph-structured
data, efectively capturing the relationships and dependencies among nodes. The core of a GNN involves
two key processes: message passing and node aggregation. For a given node  in a graph  = (, ),
the propagation of information can be expressed as:</p>
        <p>⎛
ℎ =  ⎝ () ∑︁</p>
        <p>⎞
ℎ(− 1) + ()⎠
∈()
(5)
() represents the feature vector of node  at layer ,  () is the learnable weight matrix, () is a
Here, ℎ
bias term, and  () denotes the set of neighboring nodes of . The aggregation function typically sums
the features of neighbouring nodes, allowing each node to update its representation based on its local
neighbourhood. The final node representations can then be transformed through a readout function,
enabling downstream tasks such as node classification, link prediction, or graph classification. This
framework allows GNNs to learn complex patterns and relationships within graph data, making them
powerful tools in various applications like social networks, molecular chemistry, and recommendation
systems.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Proposed approaches</title>
      <sec id="sec-5-1">
        <title>5.1. Dataset</title>
        <p>The dataset comprises 107,900 multilingual documents in TREC format, structured akin to HTML tags,
in both Bengali and English [15]. Every document has a distinct identifier assigned to it. A train set
of 20 queries/topics, also in code-mixed English and Bengali, are provided to help train the models.
Additionally, a document with human annotations providing relevance feedback is provided, which
indicates the relevance or non-relevance of documents related to a particular query.</p>
        <p>Number of documents
Average document length</p>
        <p>Number of tokens
Number of unique terms</p>
        <p>Number of postings
Number of queries</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Efect of diferent preprocessing steps</title>
        <p>5.2.1. Stopword removal
Taking motivation from [16], we explored the efect of stopword removal in relevant document extraction
from a corpus. [16] classifies stopword removal methods into corpus-based and non-corpus-based.
While the corpus-based techniques are applied to the current corpus to extract a set of stopwords
specific to the current corpus, the non-corpus-based techniques may be part of some pre-generated
libraries or sources and are not dependent on the document collection at hand. Authors proposed that
corpus-based stopword removal techniques outperform non-corpus-based stopword removal techniques.
On a similar note, we performed experiments with stopword removal corpus made available by NLTK
library (both English and Bengali words), terrier platform (English only), SMART library (English only),
and Custom stopwords library prepared by [16] using tf-idf approach.
5.2.2. Stemming
After adding stemmer to the best-performing stopword removal method in the previous section, we used
"pt.TerrierStemmer.porter" English stemmer along with best-performing "custom-corpus" to perform
stemming on English tokens (words) only. We further evaluated the performance of base models by
preprocessing both the query and documents, which marginally improved the performance of these
models. The results are compiled in Table 9.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Reranking traditional models with pre-trained models</title>
        <p>In the second phase of experiments, we aimed to enhance the rankings by leveraging
transformerbased pre-trained models, specifically focusing on Sentence-BERT (SBERT) [ 14]. Few models that we
compared are Mixed-DistilBERT[17], IndicSBERT [18], LaBSE [19], hkunlp-Instructor [20].</p>
        <p>Initially, we captured the document rankings from the top-performing base model and conducted
further re-ranking using diferent SBERT-based models. For our subsequent re-ranking experiments, we
chose the rankings obtained from Hiemstra-LM as the foundation. We compared various SBERT models
to assess their performance in code-mixed IR. The outcomes of these comparisons are summarized in a
table 10.
NDCG</p>
        <p>NDCG
Model name
TF_IDF
BM25
PL2
INL2
Himestra_LM
Model name
TF_IDF
BM25
PL2
INL2
Himestra_LM
Model name
TF_IDF
BM25
PL2
INL2
Himestra_LM
Model name
TF_IDF
BM25
PL2
INL2
Himestra_LM</p>
        <p>NDCG
NDCG</p>
        <p>NDCG</p>
        <p>Model name</p>
        <p>NDCG
0.452
0.351
0.405</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Ranking using fine-tuned SBERT models</title>
        <p>In addition to reordering the results generated by a base IR model, we also experimented with conducting
retrieval using SBERT models exclusively. To do this, we fine-tuned the SBERT models on the
codemixed corpus. We utilized the Information_Retrieval_Evaluator function from the SBERT library to
enable the model to learn suitable weights for enhancing the retrieval results. The outcomes of this
process have been consolidated in a table 11. We noticed that the performance of some models declined
after fine-tuning. This could be due to the initial language in which the model was pre-trained.
Finetuning a model pre-trained in a diferent language with a code-mixed Bengali-English corpus caused
further confusion, leading to a decline in performance.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Rerank results from fine-tuned SBERT using GNN</title>
        <p>To further improve the ranking results obtained from finely-tuned Sentence-BERT (SBERT) models, we
explored an innovative approach by integrating graph neural networks (GNN). Initially, we selected the
most promising results from our best-performing SBERT model as a baseline. These selected results
were then fed into a GNN model, aiming to leverage its capability to understand and process data in
a graph-structured form. This approach was motivated by the potential of GNNs to capture complex
relationships and interactions within the data, which we believed could significantly enhance the
ranking accuracy. An overview of the proposed approach has been compiled in figure 2.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Experiments and results</title>
      <p>The findings from the impact of diferent preprocessing and stemming techniques have been summarized
in tables 2 to table 9. Tables 10 and table 11 display the outcomes of re-ranking the base model results
using SBERT and conducting independent ranking using an SBERT-based IR model. The performance of
the GNN-based model in re-ranking SBERT results was notably poor. Despite our high expectations for
this methodology, the outcomes were disappointing. The integration of the GNN model did not yield the
anticipated improvement in the ranking results. Instead, the performance was suboptimal, falling short
of our initial projections. This experience has led us to reflect on the potential mismatches between the
problem at hand and the model’s suitability, or possibly the need for further tuning and optimization
of the GNN parameters. Moving forward, we intend to conduct a thorough analysis to understand
the underlying reasons for this performance gap and explore alternative strategies or adjustments to
the GNN model that could potentially rectify these issues and enhance the overall efectiveness of our
approach.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>Code-mixed IR in Bengali and English entails extracting helpful information from a dataset whose
content combines both languages. This is especially prevalent in multilingual cultures like India and
Bangladesh, where people often transition between languages, sometimes within the same phrase.</p>
      <p>This paper explored the performance of various techniques for code-mixed Bengali-English IR,
including traditional models, pre-trained transformer-based Sentence-BERT (SBERT), and graph neural
networks (GNN). While transformer-based models showed strong retrieval performance, the GNN
models did not meet expectations. This opens up avenues for future research, where we aim to
investigate and address the limitations GNN models face in handling the complexities of code-mixed
languages, intending to improve their efectiveness in IR tasks.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <sec id="sec-8-1">
        <title>The author(s) have not employed any Generative AI tools.</title>
        <p>[3] Y. Qiao, C. Xiong, Z. Liu, Z. Liu, Understanding the behaviors of bert in ranking, 2019. URL:
https://arxiv.org/abs/1904.07531. arXiv:1904.07531.
[4] M. R. Khan, S. Nawsad Rahmatullah, M. F. Islam, A. R. M. Kamal, M. A. Hossain, Sentiment analysis
of covid-19 vaccination in bangla language with code-mixed text from social media, in: 2022
12th International Conference on Electrical and Computer Engineering (ICECE), 2022, pp. 76–79.
doi:10.1109/ICECE57408.2022.10088478.
[5] M. N. Raihan, D. Goswami, A. Mahmud, Mixed-distil-bert: Code-mixed language modeling for
bangla, english, and hindi, 2024. URL: https://arxiv.org/abs/2309.10272. arXiv:2309.10272.
[6] R. S. Bhowmick, I. Ganguli, J. Sil, Character-level inclusive transformer architecture for information
gain in low resource code-mixed language, Neural Computing and Applications (2022). URL:
https://api.semanticscholar.org/CorpusID:247376379.
[7] Q. Mai, S. Gauch, D. Adams, Setbert: Enhancing retrieval performance for boolean logic and set
operation queries, 2024. URL: https://arxiv.org/abs/2406.17282. arXiv:2406.17282.
[8] Y. Zhang, D. Wang, Y. Zhang, Neural ir meets graph embedding: A ranking model for product
search, 2019. URL: https://arxiv.org/abs/1901.08286. arXiv:1901.08286.
[9] X. Zhang, M. Jiang, Z. Zheng, X. Tan, E. Ding, Y. Yang, Understanding image retrieval
reranking: A graph neural network perspective, 2020. URL: https://arxiv.org/abs/2012.07620.
arXiv:2012.07620.
[10] S. Robertson, H. Zaragoza, The probabilistic relevance framework: Bm25 and beyond, Found.</p>
        <p>Trends Inf. Retr. 3 (2009) 333–389. URL: https://doi.org/10.1561/1500000019. doi:10.1561/
1500000019.
[11] A. Singhal, C. Buckley, M. Mitra, Pivoted document length normalization, in: Proceedings of the
19th Annual International ACM SIGIR Conference on Research and Development in Information
Retrieval, SIGIR ’96, Association for Computing Machinery, New York, NY, USA, 1996, p. 21–29.</p>
        <p>URL: https://doi.org/10.1145/243199.243206. doi:10.1145/243199.243206.
[12] G. Amati, C. J. Van Rijsbergen, Probabilistic models of information retrieval based on measuring
the divergence from randomness, ACM Transactions on Information Systems (TOIS) 20 (2002)
357–389.
[13] D. Hiemstra, Using language models for information retrieval (2001).
[14] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, CoRR
abs/1908.10084 (2019). URL: http://arxiv.org/abs/1908.10084. arXiv:1908.10084.
[15] S. Chanda, S. Pal, Overview of the shared task on code-mixed information retrieval from social
media data, in: Forum of Information Retrieval and Evaluation FIRE-2024, 2024.
[16] S. Chanda, S. Pal, The efect of stopword removal on information retrieval for code-mixed
data obtained via social media, SN Comput. Sci. 4 (2023) 494. URL: https://doi.org/10.1007/
s42979-023-01942-7. doi:10.1007/S42979-023-01942-7.
[17] M. N. Raihan, D. Goswami, A. Mahmud, Mixed-distil-bert: Code-mixed language modeling for
bangla, english, and hindi, arXiv preprint arXiv:2309.10272 (2023).
[18] S. Deode, J. Gadre, A. Kajale, A. Joshi, R. Joshi, L3cube-indicsbert: A simple approach for learning
cross-lingual sentence representations using multilingual bert, 2023. arXiv:2304.11434.
[19] F. Feng, Y. Yang, D. Cer, N. Arivazhagan, W. Wang, Language-agnostic bert sentence embedding,
2020. arXiv:2007.01852.
[20] H. Su, W. Shi, J. Kasai, Y. Wang, Y. Hu, M. Ostendorf, W. tau Yih, N. A. Smith, L. Zettlemoyer, T. Yu,
One embedder, any task: Instruction-finetuned text embeddings, 2023. URL: https://arxiv.org/abs/
2212.09741. arXiv:2212.09741.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          , CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Thara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Poornachandran</surname>
          </string-name>
          ,
          <article-title>Transformer based language identification for malayalam-english code-mixed text</article-title>
          ,
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>118837</fpage>
          -
          <lpage>118850</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2021</year>
          .
          <volume>3104106</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>