<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aniket Deroy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Subhankar Maity</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IIT Kharagpur</institution>
          ,
          <addr-line>Kharagpur</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Code-mixing, the integration of lexical and grammatical elements from multiple languages within a single sentence, is a widespread linguistic phenomenon, particularly prevalent in multilingual societies. In India, social media users frequently engage in code-mixed conversations using the Roman script, especially among migrant communities who form online groups to share relevant local information. This paper focuses on the challenges of extracting relevant information from code-mixed conversations, specifically within Roman transliterated Bengali mixed with English. This study presents a novel approach to address these challenges by developing a mechanism to automatically identify the most relevant answers from code-mixed conversations. We have experimented with a dataset comprising of queries and documents from Facebook, and Query Relevance files (QRels) to aid in this task. Our results demonstrate the efectiveness of our approach in extracting pertinent information from complex, code-mixed digital conversations, contributing to the broader field of natural language processing in multilingual and informal text environments. We use GPT-3.5 Turbo via prompting alongwith using the sequential nature of relevant documents to frame a mathematical model which helps to detect relevant documents corresponding to a query.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;GPT</kwd>
        <kwd>Relevance</kwd>
        <kwd>Code Mixing</kwd>
        <kwd>Probability</kwd>
        <kwd>Prompt Engineering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Code-mixing, where elements from multiple languages are blended within a single sentence, is a natural
and widespread phenomenon in multilingual societies [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Code mixing is a global phenomenon
where speakers often switch between languages depending on context, audience, and medium of
communication [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. With the rapid rise of online social networking, this practice has become increasingly
common in digital conversations, where users frequently combine their native languages with others,
often using foreign scripts [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        One notable trend in India is the use of the Roman script to communicate in native languages on social
media platforms [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This practice is especially common among migrant communities who form online
groups to share information and experiences relevant to their unique circumstances [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. For instance,
Bengali speakers from West Bengal who have migrated to urban centers like Delhi or Bangalore often
establish groups such as "Bengali in Delhi" on platforms like Facebook and WhatsApp. These groups
serve as vital hubs for exchanging advice on a wide range of local issues, from housing and employment
to navigating new social environments.
      </p>
      <p>
        The COVID-19 pandemic highlighted the importance of these online communities as critical sources
of information [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. During this period, these groups became essential for sharing experiences, seeking
support, and keeping up with the frequently changing government guidelines. However, the
informal and often colloquial nature of the language used in these code-mixed conversations, typically
transliterated into Roman script, presents significant challenges for information retrieval. The lack of
standardization, combined with the blending of languages, makes it dificult to identify and extract
relevant answers, especially for those who might seek similar information at a later time [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        This paper addresses the challenge of extracting relevant information from code-mixed digital
conversations, with a specific focus on Roman transliterated Bengali mixed with English. While
code-mixing is a well-recognized phenomenon in natural language processing (NLP), the unique
characteristics of transliterated text—such as variations in spelling, grammar, and syntax—complicate
the task of efective information retrieval [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. To tackle this issue, we have developed a mechanism that
identifies the most relevant answers from these complex, multilingual discussions.
      </p>
      <p>We begin experimenting with a dataset of code-mixed conversations collected from Facebook, which
has been carefully annotated to reflect query relevance (QRels). This dataset forms the basis of our
study and is crucial for evaluating the efectiveness of our approach.</p>
      <p>
        We leverage GPT-3.5 Turbo [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] by employing carefully designed prompts that guide the model
to evaluate the relevance of documents with respect to a given query. This involves not only the
semantic understanding capabilities of GPT-3.5 Turbo but also the strategic use of the sequential nature
of documents. Often, documents are part of a series or a conversation where the relevance to a query
can be influenced by preceding or succeeding documents. By acknowledging this sequence, we can
better capture contextual relationships that might be missed if documents were considered in isolation.
      </p>
      <p>To formalize this process, we integrate GPT-3.5 Turbo’s outputs into a mathematical model. This
model takes into account the sequential dependencies among documents, treating the task of relevance
detection as a problem of finding the optimal path or chain of relevance across the sequence.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Code-mixing and transliteration have gained increasing attention in the field of natural language
processing (NLP), especially as global communication becomes more digital and multilingual [
        <xref ref-type="bibr" rid="ref11 ref12 ref13">11, 12, 13</xref>
        ].
This section reviews key studies related to code-mixing, information retrieval from code-mixed text,
and the challenges of processing Roman transliterated languages, particularly in the context of Indian
languages. Code-mixing, where speakers blend elements from multiple languages within a single
utterance, is a common linguistic phenomenon in multilingual societies [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Early studies on
codemixing focused primarily on sociolinguistic aspects, examining how and why speakers switch languages
within conversations [
        <xref ref-type="bibr" rid="ref11 ref12 ref13">11, 12, 13</xref>
        ]. However, with the advent of digital communication, researchers
have increasingly turned their attention to computational methods for processing and understanding
code-mixed text [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        Several studies have explored various NLP tasks, such as part-of-speech tagging, language
identification, and sentiment analysis, in code-mixed settings [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] provided one of the earliest comprehensive
analyses of code-mixed text, highlighting the unique challenges it poses for traditional NLP pipelines,
such as non-standard spelling, syntax variations, and the blending of multiple languages within a single
text. More recent work by [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] introduced a code-mixed dataset, spanning multiple Indian languages,
which has become a benchmark for evaluating NLP models in this domain.
      </p>
      <p>Information retrieval (IR) in code-mixed settings is relatively underexplored compared to other
NLP tasks [18]. However, the need for efective IR systems that can handle multilingual and
codemixed queries has become increasingly important, particularly in the context of digital information
exchange on social media platforms. [19] investigated the problem of query-focused summarization
in code-mixed social media data, emphasizing the complexity of extracting relevant information from
noisy, informal text. Work by [20] addressed code-mixed question answering, where the goal is to
identify correct responses from a mixed-language corpus. Their approach involved using translation
models to standardize the text before applying traditional IR techniques, demonstrating that even simple
translation-based methods can significantly improve performance. However, these methods often fail
to capture the nuances of code-mixed language, such as cultural context and colloquial expressions.</p>
      <p>Roman script transliteration of Indian languages, commonly referred to as "Romanagari" [21] for
languages like Hindi, is a widespread practice in digital communication. Transliteration introduces
additional challenges for NLP, as it often involves non-standard spellings and inconsistent usage. For
instance, multiple transliterations may exist for the same word, depending on the speaker’s regional
accent, literacy in the original script, or personal preference.</p>
      <p>Notable eforts in this area include the work by [ 22], which explored transliteration normalization
for Hindi-English code-mixed text. They developed algorithms to map Romanized text back to its
original script, enabling more accurate processing by traditional NLP models. However, normalization
remains a challenging task due to the inherent variability in transliterated text. In the context of
Bengali, the Roman script transliteration is less standardized than for Hindi, leading to even greater
variability in spelling and grammar. [23] addressed this issue by creating a Roman Bengali dataset and
proposed methods for transliteration normalization and language identification. Their work highlights
the dificulties of processing Roman Bengali and the need for specialized approaches tailored to the
characteristics of the language.</p>
      <p>While these studies provide valuable insights into code-mixing, transliteration, and information
retrieval, there is a noticeable gap in addressing the specific challenges of extracting relevant information
from code-mixed conversations in Roman transliterated Bengali. Our work builds on the foundations
laid by previous research but focuses on the unique intersection of these challenges in a real-world
context. By developing a mechanism to identify relevant answers in code-mixed discussions, we aim
to contribute to the growing body of research on multilingual NLP and enhance the accessibility of
information in linguistically diverse online communities.</p>
      <p>Large Language Models (LLMs) [24, 25, 26] like GPT-3 have shown promise in various NLP tasks,
including LI. Previous works have demonstrated the capability of GPT-3 in performing zero-shot
and few-shot learning, making it a potentially powerful tool for LI in resource-constrained settings.
However, the application of LLMs [27, 28, 29, 30] to code-mixed and morphologically rich languages
remains underexplored. Recent studies, have started to explore the use of transformers and pre-trained
models for multilingual LI, but the efectiveness of these models in Bengali languages requires further
investigation.</p>
      <p>This section places our work within the context of existing research, highlighting the contributions
of prior studies while identifying gaps that our research aims to fill.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>This shared task consists of a single dataset [31] for code mixed information retrieval. The corpus
consists of 107900 documents in the training set and 20 queries in the training set. There are 30 queries
in the testing set. The dataset is in roman transliterated bengali mixed with english language.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Task Definition</title>
      <p>The task 1 is to automatically determine the relevance of a query to a document within code-mixed
data, specifically focusing on English and Roman transliterated Bengali.</p>
      <p>Given a query and a document, the goal is to classify whether the query is relevant or not relevant
to the document. Based on the relevance we have to rank the documents. This involves handling
the complexities of code-mixing, where elements from both languages are used within the same
text, and dealing with the informal and non-standardized nature of the language. The system must
accurately capture the semantic relationship between the query and the document despite these linguistic
challenges.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Methodology</title>
      <sec id="sec-5-1">
        <title>5.1. Why Prompting?</title>
        <p>Prompting [32] for Information Retrieval is a burgeoning approach that leverages large language models
(LLMs) to enhance the retrieval of relevant information from complex, unstructured data, such as
code-mixed text or informal online conversations [32]. Below are several reasons why prompting is
becoming an efective strategy in information retrieval (IR):
- Handling Ambiguity and Contextual Nuances: Traditional IR systems often struggle with
understanding the nuanced language, ambiguity, and context found in unstructured or informal
text, such as code-mixed conversations. Prompting LLMs allows these models to interpret context
more efectively by guiding them to generate or rank responses that are contextually appropriate,
even when dealing with code-mixing or informal language structures [33]. By crafting specific
prompts, users can elicit more relevant and accurate results that account for the complexities of
the input text.
- Enhanced Language Understanding: Large language models like GPT-3.5 are pre-trained on
vast datasets that include a variety of languages and dialects [34]. This extensive training enables
them to understand and generate text across diferent languages and contexts [ 34]. By using
prompting, these models can be directed to focus on the most relevant aspects of a query or
document, improving the retrieval process even in multilingual and code-mixed scenarios. For
example, when retrieving information from Roman transliterated Bengali mixed with English, an
LLM can be prompted to recognize and process the code-mixed language more efectively than
traditional IR systems.
- Adaptability to Informal and Unstructured Text: Prompting allows LLMs to adapt to the
informal and often unstructured nature of social media text [35], which is common in online
communities. This flexibility is particularly beneficial when dealing with code-mixed or transliterated
text, where the lack of standardization poses a challenge to conventional IR techniques. Prompted
language models can generate or filter responses that align more closely with the informal tone
and style of the original text, thereby improving the relevance of the retrieved information.
- Reduction of Noise and Irrelevance: One of the major challenges in IR is filtering out irrelevant
or noisy data, especially in informal online conversations where of-topic or redundant information
is common. By using targeted prompts, LLMs can be instructed to prioritize certain types of
information, such as direct answers to specific questions, while de-emphasizing or ignoring
irrelevant content [36]. This leads to a more eficient and efective retrieval process, particularly
in environments where users are seeking specific answers within a sea of mixed and informal
language.
- Scalability and Customization: Prompting for information retrieval ofers scalability and
customization that traditional IR systems might lack. By designing prompts tailored to specific
contexts or types of queries, LLMs can be dynamically adjusted to meet the needs of diferent
retrieval tasks [36]. This customization is particularly useful in handling domain-specific language
or code-mixed scenarios, where standard IR systems might require extensive re-training or
reconfiguration.
- Real-Time Processing and Interaction: In real-time communication platforms, the ability
to quickly retrieve relevant information based on ongoing conversations is crucial. Prompting
enables LLMs to process and respond to queries in real-time, enhancing the interactivity and
responsiveness of the IR system [36]. This is especially beneficial in scenarios where users are
engaged in active discussions and require immediate, contextually relevant information.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Merging Prompt and Mathematical Model-Based Approaches</title>
        <p>We used the GPT-3.5 Turbo model via prompting through the OpenAI API2 to solve the document
retrieval task. The process begins by first converting all the code-mixed sentences to english for both the
queries and the documents. After this we try to determine the relevance scores, we used the following
prompt:</p>
        <p>"Given the query &lt;query&gt; and the document &lt;document&gt;, find how relevant is the query to the document
based on semantic similarity. Provide a relevance score between 0 and 1. Only state the score."
2https://platform.openai.com/docs/models/gpt-3-5-turbo</p>
        <p>After the prompt is provided to the LLM, the following steps happen internal to the LLM while
generating the output. The following outlines the steps that occur internally within the LLM,
summarizing the prompting approach using GPT-3.5 Turbo:</p>
        <sec id="sec-5-2-1">
          <title>Step 1: Tokenization</title>
          <p>Step 2: Embedding
• Prompt:  = [1, 2, . . . , ]
• The input text (prompt) is first tokenized into smaller units called tokens. These tokens are often
subwords or characters, depending on the model’s design.
• Tokenized Input:  = [1, 2, . . . , ]
• Each token is converted into a high-dimensional vector (embedding) using an embedding matrix
.
• Embedding Matrix:  ∈ R| |× , where | | is the size of the vocabulary and  is the embedding
dimension.</p>
          <p>• Embedded Tokens: emb = [(1), (2), . . . , ()]</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>Step 3: Positional Encoding</title>
          <p>• Since the model processes sequences, it adds positional information to the embeddings to capture
the order of tokens.
• Positional Encoding:  ()
• Input to the Model:  = emb +</p>
        </sec>
        <sec id="sec-5-2-3">
          <title>Step 4: Attention Mechanism (Transformer Architecture)</title>
          <p>• Attention Score Calculation: The model computes attention scores to determine the importance
of each token relative to others in the sequence.
• Attention Formula:</p>
          <p>Attention(, ,  ) = softmax
︂(  )︂
√

(1)
• where  (query),  (key), and  (value) are linear transformations of the input .
• This attention mechanism is applied multiple times through multi-head attention, allowing the
model to focus on diferent parts of the sequence simultaneously.</p>
        </sec>
        <sec id="sec-5-2-4">
          <title>Step 5: Feedforward Neural Networks</title>
          <p>• The output of the attention mechanism is passed through feedforward neural networks, which
apply non-linear transformations.
• Feedforward Layer:</p>
          <p>FFN() = max(0, 1 + 1)2 + 2
(2)
• where 1, 2 are weight matrices and 1, 2 are biases.</p>
          <p>Step 6: Stacking Layers
• Multiple layers of attention and feedforward networks are stacked, each with its own set of
parameters. This forms the "deep" in deep learning.
(3)
(4)
(5)
• Layer Output:
Step 7: Output Generation
() = LayerNorm(() + Attention((), (),  ()))</p>
          <p>(+1) = LayerNorm(() + FFN(()))
• The final output of the stacked layers is a sequence of vectors.
• These vectors are projected back into the token space using a softmax layer to predict the next
token or word in the sequence.
• Softmax Function:
 (|) =</p>
          <p>exp()
∑︀|=|1 exp( )
• where  is the logit corresponding to token  in the vocabulary.
• The model generates the next token in the sequence based on the probability distribution, and
the process repeats until the end of the output sequence is reached.</p>
        </sec>
        <sec id="sec-5-2-5">
          <title>Step 8: Decoding</title>
          <p>• The predicted tokens are then decoded back into text, forming the final output.</p>
          <p>• Output Text:  = [1, 2, . . . , ]</p>
          <p>After obtaining the relevance score, we used the following mathematical formulation to account for
the sequential presence of relevant documents. This can be written as follows:
 (+1 | ) =
⎧⎪Score(+1)
⎪
⎪
⎪⎨Score(+1)
if Score(+1) &lt; 0.3,  = 
if  = − 1
⎪0.2 + Score(+1) if Score(+1) &gt;= 0.3,  = 
⎪
⎪⎩⎪Score(+1) ℎ</p>
          <p>This equation now reflects that if the score of the current document  is less than 0.3 and the
previous document is relevant, the probability of the current document being relevant is simply equal
to the relevance score of current document.</p>
          <p>If the previous document is relevant and if the score of the current document  is greater than equal
to 0.3 then the probability that the current document is relevant is 0.2 + Score for the current document.
For the first document, the probability is equal to the relevance score of current document. In all other
situations, the probability is equal to the relevance score of current document. If the probability score
of a particular document is greater than 0.5, we consider the document to be relevant to the query. Like
this we found out all documents which are relevant to a query.</p>
          <p>An example of the mathematical formulation and how it helps to detect relevant documents is shown
in Table 1. The table reflects a range of documents to a code mixed query. The relevance scores help
identify how closely each document addresses the query, while the probability scores provide insight
into the potential usefulness of the documents based on the provided content. Overall, Documents 1-4
stands out as particularly relevant based on probability scores.</p>
          <p>For the five results reported, we ran the GPT model at diferent temperature values namely 0.5, 0,6,
0.7, 0.8, and 0.9. The diagram for GPT-3.5 Turbo is shown in Figure 1. The figure representing the
methodology is shown in Figure 2.</p>
          <p>At lower temperatures, the model’s responses are more deterministic and focused. It generates
outputs that are likely to be relevant and closely aligned with the input, making it useful for tasks
requiring precision, such as retrieving specific information or handling queries with clear intent. At
higher temperatures results in highly diverse and less predictable outputs. It can be useful in exploratory
tasks where creativity and variation are needed, but it may also risk generating less coherent or relevant
responses. In code-mixed scenarios, this could capture the full spectrum of linguistic creativity but
might require careful handling to ensure relevance. So we used a temperature range of 0.5 to 0.9.
Query
Kivabe bhalo bhabe
ingreji shikhbo?
Kivabe bhalo bhabe
ingreji shikhbo?
Kivabe bhalo bhabe
ingreji shikhbo?
Kivabe bhalo bhabe
ingreji shikhbo?
Kivabe bhalo bhabe
ingreji shikhbo?</p>
          <p>Document
Shudhumatro lekhar opor focus korle
hobe na.</p>
          <p>Ingreji songs shunle shikha sohoj hoy.</p>
          <p>Ingreji film dekhle vocabulary bere jaye.</p>
          <p>Conversation practice korao khub helpful.</p>
          <p>Ingreji shikhar jonyo grammar book khub
important.</p>
          <p>Kintu speaking practice beshi dorkar.</p>
          <p>Bhasha shikhte bhalo tutor nirbachon
kora joruri.</p>
          <p>Bibhinno apps byabohar kore practice
kora jaye.
time lagbe.
0.55
0.45
0.35
0.45
0.20
0.701773</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In conclusion, this study addresses the critical challenges of extracting relevant information from
code-mixed conversations, specifically within Roman transliterated Bengali mixed with English. This
linguistic phenomenon is prevalent among migrant communities in India, who often rely on social media
platforms to share and seek vital information, especially during crises like the COVID-19 pandemic.
The informal and non-standardized nature of these conversations presents unique dificulties for
information retrieval. To tackle these challenges, we developed a novel approach that leverages the
GPT-3.5 Turbo model in conjunction with a sequential engineering approach, achieving notable success
in retrieving pertinent answers from complex, code-mixed digital conversations. The efectiveness of
our method is demonstrated through the results on the test set documents and queries, which provides
a valuable resource for future research in natural language processing within multilingual and informal
text environments. This work contributes to enhancing information accessibility for marginalized
communities, underscoring the potential of advanced AI models in bridging communication gaps in
diverse linguistic landscapes. We observe that the GPT-3.5 model along with mathematical formulation
approach performs well for the task of Code mixed information retrieval, though there is scope for
improvement.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to: Drafting content, Grammar
and spelling check, etc. After using this tool/service, the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.
code-mixing: The role of linguistic theory based synthetic data, in: Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp.
1543–1553.
[18] U. Barman, Automatic processing of code-mixed social media content, Ph.D. thesis, Dublin City</p>
      <p>University, 2019.
[19] D. Gupta, A. Ekbal, P. Bhattacharyya, A semi-supervised approach to generate the code-mixed
text using pre-trained encoder and transfer learning, in: T. Cohn, Y. He, Y. Liu (Eds.), Findings
of the Association for Computational Linguistics: EMNLP 2020, Association for Computational
Linguistics, Online, 2020, pp. 2267–2280. URL: https://aclanthology.org/2020.findings-emnlp.206.
doi:10.18653/v1/2020.findings-emnlp.206.
[20] K. R. Chandu, A. W. Black, Style variation as a vantage point for code-switching, arXiv preprint
arXiv:2005.00458 (2020).
[21] R. Mhaiskar, Romanagari an alternative for modern media writings, Bulletin of the Deccan College</p>
      <p>Research Institute 75 (2015) 195–202.
[22] K. Bali, J. Sharma, M. Choudhury, Y. Vyas, “i am borrowing ya mixing?" an analysis of english-hindi
code mixing in facebook, in: Proceedings of the first workshop on computational approaches to
code switching, 2014, pp. 116–126.
[23] B. Sarkar, N. Sinhababu, M. Roy, P. K. D. Pramanik, P. Choudhury, Mining multilingual and
multiscript twitter data: unleashing the language and script barrier, International Journal of
Business Intelligence and Data Mining 16 (2020) 107–127.
[24] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are unsupervised
multitask learners, in: OpenAI Blog, volume 1, 2019.
[25] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
the limits of transfer learning with a unified text-to-text transformer, Journal of Machine Learning
Research 21 (2020) 1–67.
[26] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
Roberta: A robustly optimized bert pretraining approach, in: arXiv preprint arXiv:1907.11692,
2019.
[27] W. X. Zhao, K. Zhou, J. Li, X. Tang, J. J. Wang, J. Liu, T. Wang, Y. Bao, J.-R. Wen, A survey of large
language models, in: arXiv preprint arXiv:2303.18223, 2023.
[28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin,</p>
      <p>Attention is all you need, Advances in neural information processing systems 30 (2017) 5998–6008.
[29] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Fine-tuning gpt-2 for human-like
text generation, in: arXiv preprint arXiv:1907.11692, 2019.
[30] R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, Y. Choi, Defending against
neural fake news, in: Advances in Neural Information Processing Systems, volume 32, 2019, pp.
9054–9065.
[31] S. Chanda, S. Pal, The efect of stopword removal on information retrieval for code-mixed
data obtained via social media, SN Comput. Sci. 4 (2023) 494. URL: https://doi.org/10.1007/
s42979-023-01942-7. doi:10.1007/S42979-023-01942-7.
[32] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train, prompt, and predict: A systematic
survey of prompting methods in natural language processing, ACM Computing Surveys 55 (2023)
1–35.
[33] P. Singh, M. Patidar, L. Vig, Translating across cultures: Llms for intralingual cultural adaptation,
arXiv preprint arXiv:2406.14504 (2024).
[34] G. Yenduri, M. Ramalingam, G. C. Selvi, Y. Supriya, G. Srivastava, P. K. R. Maddikunta, G. D.</p>
      <p>Raj, R. H. Jhaveri, B. Prabadevi, W. Wang, et al., Gpt (generative pre-trained transformer)–a
comprehensive review on enabling technologies, potential applications, emerging challenges, and
future directions, IEEE Access (2024).
[35] G. E. Zgheib, N. Dabbagh, Social media learning activities (smla): Implications for design., Online</p>
      <p>Learning 24 (2020) 50–66.
[36] J. Kaddour, J. Harris, M. Mozes, H. Bradley, R. Raileanu, R. McHardy, Challenges and applications
of large language models, arXiv preprint arXiv:2307.10169 (2023).
[37] S. Chanda, S. Pal, Overview of the shared task on code-mixed information retrieval from social
media data, in: Forum of Information Retrieval and Evaluation FIRE-2024, 2024.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Sippola</surname>
          </string-name>
          ,
          <article-title>Multilingualism and the structure of code-mixing, in: The Routledge handbook of Pidgin and Creole languages</article-title>
          ,
          <source>Routledge</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>474</fpage>
          -
          <lpage>489</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E. O.</given-names>
            <surname>Aboh</surname>
          </string-name>
          ,
          <article-title>Lessons from neuro-(a)-typical brains: universal multilingualism, code-mixing, recombination, and executive functions</article-title>
          ,
          <source>Frontiers in psychology 11</source>
          (
          <year>2020</year>
          )
          <fpage>488</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>A. De Swaan</surname>
          </string-name>
          ,
          <article-title>Words of the world: The global language system</article-title>
          , John Wiley &amp; Sons,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Multilingual resources and practices in digital communication, in: The Routledge handbook of language and digital communication</article-title>
          ,
          <source>Routledge</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>118</fpage>
          -
          <lpage>132</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shekhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Garg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shivani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <article-title>Hatred and trolling detection transliteration framework using hierarchical lstm in code-mixed social media text</article-title>
          ,
          <source>Complex &amp; Intelligent Systems</source>
          <volume>9</volume>
          (
          <year>2023</year>
          )
          <fpage>2813</fpage>
          -
          <lpage>2826</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Komito</surname>
          </string-name>
          ,
          <article-title>Social media and migration: Virtual community 2.0, Journal of the American society for information science</article-title>
          and technology
          <volume>62</volume>
          (
          <year>2011</year>
          )
          <fpage>1075</fpage>
          -
          <lpage>1086</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>M. M. Meurer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Waldkirch</surname>
            ,
            <given-names>P. K.</given-names>
          </string-name>
          <string-name>
            <surname>Schou</surname>
            ,
            <given-names>E. L.</given-names>
          </string-name>
          <string-name>
            <surname>Bucher</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Burmeister-Lamp</surname>
          </string-name>
          ,
          <article-title>Digital afordances: How entrepreneurs access support in online communities during the covid-19 pandemic, Small Business Economics (</article-title>
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <article-title>Natural language processing for information retrieval</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>39</volume>
          (
          <year>1996</year>
          )
          <fpage>92</fpage>
          -
          <lpage>101</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Janse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vassalou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Papazachariou</surname>
          </string-name>
          ,
          <article-title>Variation in the vowel system of mišótika cappadocian: Findings from two refugee villages in greec</article-title>
          ,
          <source>in: 13th International Conference on Greek Linguistics</source>
          , University of Westminster,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>T. B. Brown</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          , arXiv preprint ArXiv:
          <year>2005</year>
          .
          <volume>14165</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Jauhiainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jauhiainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Linden</surname>
          </string-name>
          ,
          <article-title>A survey on automatic language identification in written texts</article-title>
          ,
          <source>in: Journal of Artificial Intelligence Research</source>
          , volume
          <volume>65</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>675</fpage>
          -
          <lpage>782</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Muthusamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Cole</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. T.</given-names>
            <surname>Oshika</surname>
          </string-name>
          ,
          <article-title>Automatic language identification: A review/tutorial</article-title>
          , in
          <source>: IEEE Signal Processing Magazine</source>
          , volume
          <volume>11</volume>
          ,
          <year>1994</year>
          , pp.
          <fpage>33</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G. I.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Singla</surname>
          </string-name>
          ,
          <article-title>Sentiment analysis of code-mixed social media text (sa-cmsmt) in indianlanguages</article-title>
          ,
          <source>in: 2021 International Conference on Computing Sciences (ICCS)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Hidayatullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Qazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. T. C.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Apong</surname>
          </string-name>
          ,
          <article-title>A systematic review on language identification of code-mixed text: techniques, data availability, challenges, and framework development</article-title>
          ,
          <source>IEEE access 10</source>
          (
          <year>2022</year>
          )
          <fpage>122812</fpage>
          -
          <lpage>122831</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G. I.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Singla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Reshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Salameh</surname>
          </string-name>
          ,
          <article-title>Machine learning techniques for sentiment analysis of code-mixed and switched indian social media text corpus: A comprehensive review</article-title>
          ,
          <source>International Journal of Advanced Computer Science and Applications</source>
          <volume>13</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Hidayatullah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Qazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. T. C.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Apong</surname>
          </string-name>
          ,
          <article-title>A systematic review on language identification of code-mixed text: techniques, data availability, challenges, and framework development</article-title>
          ,
          <source>IEEE access 10</source>
          (
          <year>2022</year>
          )
          <fpage>122812</fpage>
          -
          <lpage>122831</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pratapa</surname>
          </string-name>
          , G. Bhat,
          <string-name>
            <given-names>M.</given-names>
            <surname>Choudhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sitaram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dandapat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bali</surname>
          </string-name>
          , Language modeling for
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>