<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the shared task on code-mixed information retrieval from social media data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Supriya Chanda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sukomal Pal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Institute of Technology (BHU) Varanasi</institution>
          ,
          <addr-line>Uttar Pradesh</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The rise of multilingual communication on social media platforms such as Facebook, Twitter, and WhatsApp presents a compelling challenge for information retrieval in code-mixed contexts within natural language processing. This paper provides an overview of the Code-Mixed Information Retrieval Shared Task, which is part of the FIRE-2024 conference. The main focus of this experiment was the evaluation of how relevant documents code-mixed from a corpus of Bengali-English comments were to be given for a set of code-mixed queries. Six teams showed interest in participating in the shared task; two teams provided their runs. This article describes the models used by the competing teams and their performance evaluated on the Mean Average Precision (MAP), a significant metric used for information retrieval tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Code-Mixed</kwd>
        <kwd>Bengali</kwd>
        <kwd>English</kwd>
        <kwd>Information Retrieval</kwd>
        <kwd>Social Media</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The proliferation of multilingual and code-mixed content on digital platforms, especially in multilingual
societies like India, brings challenging problems for Natural Language Processing (NLP) and Information
Retrieval (IR). Code-mixing is the act of mixing two or more languages in a single discourse, a common
linguistic phenomenon. Bengali-English and Hindi-English are typical examples in India. Traditional IR
systems, mainly designed for monolingual datasets, face challenges when dealing with the complexities
of code-mixed data. This calls for new approaches tailored to these hybrid linguistic environments. As
online social networks continue to grow, many of its users communicate in native languages using
foreign scripts. This is a norm in India, where people use the Roman script on social networks. The trend
is mostly noticeable among migrants who form an online community to share relevant information and
experiences.</p>
      <p>These discussions usually contain code-mixed text, wherein users use informal, colloquial language
often transliterated into Roman script. This lack of standardization makes it challenging to recognize
and emphasize relevant answers from these discussions, especially when others are looking for the
same information later. Our task is to create a means of identifying the most relevant answers to
these code-mixed discussions. This will focus on Roman transliterated Bengali mixed with the English
language.</p>
      <p>The Bengali-English code mixing poses unique challenges for IR due to the inherent linguistic
diferences between the two languages. Bengali, being an inflectional language, has rich morphological
variation, whereas English is a more rigidly structured language. These diferences make standard IR
tasks, such as tokenization, parsing, and language comprehension, challenging. Further complicating
this task is the frequent use of Roman script for Bengali, which introduces transliteration issues, where
non-standardized spellings and ambiguous language boundaries create additional hurdles for IR systems.</p>
      <p>
        Despite the numerous advancements in multilingual NLP, research on IR for code-mixed languages
still needs to be addressed. Much of the existing work has been on language identification, sentiment
analysis, hate speech identification, and transliteration normalization. However, their application to IR
in resource-scarce languages like Bengali needs an improvement. To bridge such gaps, linguistic insights
can be integrated with machine learning approaches to handle the nuances that exist in code-mixed
data. In recent years, we have explored various text processing tasks on code-mixed data like word-level
language identification [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], sentiment analysis [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ], hate speech identification [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and sarcasm
detection [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>This paper outlines the overview of the CMIR-2024: Code-Mixed Information Retrieval from
Social Media Data 1 shared task that focuses on developing IR systems for Bengali-English code-mixed
data. The task focuses on contributing to more robust and inclusive IR systems that better serve
multilingual digital communities by addressing the linguistic complexities of code-mixed text.</p>
      <p>The participants would be provided with training and test dataset. This is an information retrieval
task. Given a Query (Q), systems need to pinpoint the most relevant answers from these code-mixed
documents. To our knowledge, this is the first shared task on information retrieval on Bengali-English
Code-Mixed text.</p>
      <p>This work discusses the various models submitted to the shared task and the results of the participating
teams. The rest of the article is orchestrated as follows: Section 2 describes the shared task. Section
3 discusses about the dataset. Section 4 summarizes the systems and the methodologies used in each
participating team for the shared task and highlights the features of each model. The analysis of the
results and findings of the methodologies submitted by the participants are presented in Section 5.
Concluding remarks are presented in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task Description</title>
      <p>The task 2 deals with automatically determining the relevance of a query to a document within
codemixed data, mainly focusing on English and Roman transliterated Bengali. The idea is to classify
whether a given document is relevant or not relevant to a query and rank the documents accordingly. It
includes the handling of code-mixed text complexities, where the coexistence of elements from two
languages, and informal non-standardized nature of language is dealt with. At the same time, this
system should capture the correct semantic relationship between the query and the document.</p>
      <p>We can define code-mixed IR (CMIR) like that when query terms and documents belong to diferent
languages which may be using their native scripts or non-native ones. Here, both query and documents
can contain multiple languages and scripts. If  ∈ ︀⟨ (), ()⟩︀ where  ≥ 2 and  ≥ 1. where  = union
of  many languages and  = union of  many scripts. Similarly, the document pool thus becomes
 = ⋃︁ (),()
where () = {1, 2, . . . , }, () = {1, 2, . . . ,  } and</p>
      <p>(),() = set of documents in language from () written in script from ().</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>
        It was challenging to find an appropriate code-mixed dataset on the web that matches our research
objectives. Therefore, we created our own dataset by gathering data from social media platforms, namely
Facebook [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We targeted groups and public pages with high engagement from Bengali-speaking users
to ensure the inclusion of code-mixed Bengali-English language data. Bengali is the native language of
people in both Bangladesh and the Indian state of West Bengal.
      </p>
      <p>Through the data collection process, it was noticed that the majority of users post their questions in
Facebook groups where replies are made through comments. In our dataset, queries are the original
posts while the comments are documents containing which information needs to be extracted. This</p>
      <sec id="sec-3-1">
        <title>1https://cmir-iitbhu.github.io/cmir/results.html</title>
        <p>2https://cmir-iitbhu.github.io/cmir/
approach simply transformed the traditional information retrieval system by considering posts as a
query and filtering the responses from comments.</p>
        <p>The final dataset consists of 50 queries and 107,900 documents. We also tried diferent approaches to
identify stopwords and measure their influence on information retrieval performance. Statistics of the
dataset are as follows:.</p>
        <sec id="sec-3-1-1">
          <title>Attributes</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>Document and Query format</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>Total number of documents in the corpora</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>Total number of words</title>
        </sec>
        <sec id="sec-3-1-5">
          <title>Total Number of unique words</title>
        </sec>
        <sec id="sec-3-1-6">
          <title>Total number of Bengali (BN) words</title>
        </sec>
        <sec id="sec-3-1-7">
          <title>Total Number of unique Bengali (BN) words</title>
        </sec>
        <sec id="sec-3-1-8">
          <title>Total number of English (EN) words</title>
        </sec>
        <sec id="sec-3-1-9">
          <title>Total Number of unique English (EN) words</title>
        </sec>
        <sec id="sec-3-1-10">
          <title>Total Number of Queries (Q)</title>
        </sec>
        <sec id="sec-3-1-11">
          <title>Total Number of relevant documents (QRels)</title>
        </sec>
        <sec id="sec-3-1-12">
          <title>Mean value of relevant documents per query</title>
        </sec>
        <sec id="sec-3-1-13">
          <title>Values</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>In total, six teams registered for the CMIR-2024: Code-Mixed Information Retrieval shared task. However,
in this, only two; Team BITS and TextTitans were able to deliver their system outputs.</p>
      <p>The Team BITS team examined numerous techniques, including more classic machine learning
models as well as more advanced architectures built on top of the transformer pre-trained architecture.
Sentence-BERT was front-and-center for semantic representation with Graph Neural Networks added
in to capture relational information from the data. It then combined these methods together for the
purpose of increasing retrieval of relevant information within the code-mixed text.</p>
      <p>In contrast, the TextTitans team developed a novel methodology centered around the GPT-3.5 Turbo
model. Their approach utilized a sequential engineering strategy to leverage the generative power of
GPT-3.5 Turbo to handle code-mixed queries and improve retrieval accuracy. The fine-tuning of this
model and the integration of the engineering steps tailored to the specific challenges of code-mixed IR
were the aims of the team to address the linguistic complexities inherent in the task.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>The evaluation of the systems submitted by Team BITS and Team TextTitans ofers insight into their
performance in terms of various metrics and approaches.</p>
      <p>Team BITS tested several pre-processing and stemming techniques with their results. They also tried
re-ranking the base model results with SBERT and independently applied an SBERT-based information
retrieval model. With significant efort, the integration of a GNN-based model for re-ranking SBERT
results was disappointing. The performance of GNN model was very unsatisfactory and not good as
initially expected. This should mean there might be something amiss with the relation of the task to
architecture or requires more tuning towards optimization. The team holds that further investigation is
also necessary in order to highlight what exactly contributes to underperforming GNN based approach.
Alternative strategies and further fine-tuning the GNN parameters would be explored in future work to
make its ranking efectiveness potentially better.</p>
      <p>Team TextTitans evaluated their system’s performance using a set of standard information retrieval
metrics: Mean Average Precision (MAP), normalized Discounted Cumulative Gain (NDCG), Precision at
nDCG Score</p>
      <sec id="sec-5-1">
        <title>TextTitans</title>
      </sec>
      <sec id="sec-5-2">
        <title>TextTitans</title>
      </sec>
      <sec id="sec-5-3">
        <title>TextTitans</title>
      </sec>
      <sec id="sec-5-4">
        <title>TextTitans</title>
      </sec>
      <sec id="sec-5-5">
        <title>TextTitans</title>
      </sec>
      <sec id="sec-5-6">
        <title>Team BITS</title>
      </sec>
      <sec id="sec-5-7">
        <title>Team BITS</title>
      </sec>
      <sec id="sec-5-8">
        <title>Team BITS</title>
        <p>Team BITS
submit_cmir
submit_cmir_1
submit_cmir_2
submit_cmir_3
submit_cmir_4
submission_1
submission_2
submission_3
submission_4
5 (P@5), and Precision at 10 (P@10). The results across all their submissions were very consistent, with
very minor diferences. For MAP, the first four submissions all returned the same score of 0.701, while
the fifth submission scored slightly higher at 0.703. The NDCG scores for the first four submissions
were identical at 0.797 and had a slight increase to 0.799 in the fifth submission. P@5 scores for all
submissions were 0.793, which meant that all runs produced equal accuracy for the top five ranked
documents. P@10 scores were identical across all submissions at 0.766. Although the fifth submission
showed only a slight gain in terms of MAP and NDCG, precision metrics (P@5 and P@10) remained
unchanged, which implies stability in performance for relevant documents retrieval in top-ranked
results.</p>
        <p>Analyzing both teams, the system of Team TextTitans had better performance consistency as
observed with minute rank quality improvements by their fifth submission (See Table 2). Their usage of
MAP, NDCG, and precision-based metrics implies that the retrieval system of Team TextTitans was
stable, ranking most of the relevant documents atop all queries used. Meanwhile, the GNN-based
re-ranking approach of Team BITS faced a problem:. This may have had further scope for improvement.
Experiments performed with SBERT re-ranking for Team BITS indicated some possible improvement,
but the addition of the GNN model did not improve performance and needed further investigation.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In conclusion, The Code-Mixed Information Retrieval Shared Task at FIRE-2024 showcased core
challenges and opportunities arising during the retrieval of relevant documents in a code-mixed scenario,
especially with regards to Bengali-English text. The task did well to present complexities regarding
informal language usage and management through multiple scripts in the given code-mixed data. Only
two teams provided system predictions, and the results give useful insight into how diferent models
might work on this task. MAP score evaluation indicates that though there is some progress in this area,
there is still much to be researched and modeled in order to catch the semantic subtleties of code-mixed
languages. This shared task forms the foundation for further work in the area of code-mixed information
retrieval and encourages more advanced techniques and broader participation in future editions.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>We would like to express our sincere gratitude to Prof. Kripabandhu Ghosh (IISER Kolkata, India)
and Prof. Thomas Mandl (Universitat Hildesheim, Germany) for providing us with the opportunity to
organize this task as part of FIRE 2024. We deeply appreciate their trust and collaboration, which has
significantly contributed to the growth and recognition of our work.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <sec id="sec-8-1">
        <title>The author(s) have not employed any Generative AI tools.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Misha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>Advancing language identification in code-mixed tulu texts: Harnessing deep learning techniques</article-title>
          .,
          <source>in: FIRE (Working Notes)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>223</fpage>
          -
          <lpage>230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>Irlab@ iitbhu@ dravidian-codemix-fire2020: Sentiment analysis for dravidian languages in code-mixed text</article-title>
          .,
          <source>in: FIRE (Working Notes)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>535</fpage>
          -
          <lpage>540</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>Sentiment analysis and homophobia detection of code-mixed dravidian languages leveraging pre-trained model and word-level language tag</article-title>
          , in: Working Notes of FIRE 2022-
          <article-title>Forum for Information Retrieval Evaluation (Hybrid)</article-title>
          .
          <source>CEUR</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>Sentiment analysis of code-mixed dravidian languages leveraging pretrained model and word-level language tag</article-title>
          ,
          <source>Natural Language Processing</source>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          . doi:
          <volume>10</volume>
          . 1017/nlp.
          <year>2024</year>
          .
          <volume>30</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sheth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>Coarse and fine-grained conversational hate speech and ofensive content identification in code-mixed languages using fine-tuned multilingual embedding, in: Forum for Information Retrieval Evaluation (Working Notes)(FIRE)</article-title>
          .
          <source>CEUR-WS. org</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>502</fpage>
          -
          <lpage>512</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>Sarcasm detection in tamil and malayalam dravidian code-mixed text</article-title>
          .,
          <source>in: FIRE (Working Notes)</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chanda</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Pal,</surname>
          </string-name>
          <article-title>The efect of stopword removal on information retrieval for code-mixed data obtained via social media</article-title>
          ,
          <source>SN Comput. Sci. 4</source>
          (
          <year>2023</year>
          )
          <article-title>494</article-title>
          . URL: https://doi.org/10.1007/ s42979-023
          <article-title>-01942-7</article-title>
          . doi:
          <volume>10</volume>
          .1007/S42979-023-01942-7.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>