<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Forum for Information Retrieval Evaluation, December</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>marization using Pre-Trained Models on Tamil, English, Gujarati and Bengali</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tanisha Sriram</string-name>
          <email>tanisha2310538@ssn.edu.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ananya Raman</string-name>
          <email>ananya2310278@ssn.edu.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sowmya Anand</string-name>
          <email>sowmya2310543@ssn.edu.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Durairaj Thenmozhi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Indian Languages, Automatic Text Summarization</institution>
          ,
          <addr-line>Article Summarization, Bengali, English, Gujarati, Tamil</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sri Sivasubramaniya Nadar College of Engineering</institution>
          ,
          <addr-line>Kalavakkam, Chennai-603110</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <fpage>2</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper explores machine learning models for the Indian Language Summarization (ILSUM 2024) shared task, with a specific focus on generating summaries from news articles in four languages: Bengali, English, Gujarati, and Tamil. Representing team ”SynopSizers” in this task, we addressed the gap of underrepresented Indian languages in NLP, particularly in text summarization. Though there is an abundant availability of large-scale datasets for languages like English and French, there has been a severe underrepresentation of NLP modelling of Indian languages, specifically in the field of text summarization. The central aim is to address this gap and narrow it. A key challenge of this process was the presence of code-mixing and script-mixing, where English phrases and Latin scripts were embedded in articles written in Indian languages. Popular English-trained models struggled with these challenges and hence required the use of multilingual models. Several models were tested and trained during the process. The models were evaluated using standard ROUGE metrics. Among the models tested, an extractive frequency-based model demonstrated the most consistent performance across all languages.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In recent years, Natural Language Processing (NLP) has seen huge leaps, transforming how we interact
and understand text-based data. It has integrated itself into the way we learn and process, from
basic tokenization to more complex processes like detecting hate speech, retrieving and summarizing
legal documents, analyzing sentiment, and identifying fake news [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], to name a few. The accuracy
of machines imitating humans has reached a scarily stunning level [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. And, with the sheer volume
of digital content, be it social media, magazines, or even newspapers, NLP plays an important role in
language comprehension as well. Thus, NLP models play an important role in text summarization,
which focuses on distilling large amounts of information into summaries [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This allows human readers
to grasp concepts briefly and concisely.
      </p>
      <p>
        Extensive research and development have gone into languages like English, Chinese, German, French,
and Spanish, having large-scale datasets and advanced models [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Unfortunately, the same cannot
be said for Indian languages — very little attention has been given to these languages. Despite the
millions who speak these languages, eforts in creating efective NLP tools for them, particularly for
Automatic Text Summarization (ATS) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], remain scarce. Most available datasets are either too small or
inaccessible to the public, limiting their utility for meaningful research and development [
        <xref ref-type="bibr" rid="ref1 ref6">1, 6</xref>
        ].
      </p>
      <p>In an attempt to narrow this chasm, the Indian Language Summarization (ILSUM) shared task
was initiated. For the ILSUM 2024 edition, the dataset (publicly available corpora specifically for
summarization) has been compiled from leading national newspapers and features more than 15,000
datasets contain the presence of code-mixing and script-mixing, where English phrases and Latin scripts</p>
      <p>CEUR</p>
      <p>
        ceur-ws.org
are interwoven with Indian-language content [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Tackling these challenges requires a nuanced
approach to the rich linguistic diversity that Indian languages represent.
      </p>
      <p>This research aims to foster the development of NLP tools that can handle the complexities of
multilingual and code-mixed content, thus making an attempt to pave the way for more inclusive and
wide-reaching innovations in the field of natural language processing.</p>
      <p>
        In recent years, Natural Language Processing (NLP) has seen huge leaps, transforming how we
interact and understand text-based data. It has integrated itself into the way we learn and process, from
basic tokenization to more complex processes like detecting hate speech, retrieving and summarizing
legal documents, analyzing sentiment, and identifying fake news [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], to name a few. The accuracy
of machines imitating humans has reached a scarily stunning level [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. And, with the sheer volume
of digital content, be it social media, magazines, or even newspapers, NLP plays an important role in
language comprehension as well. Thus, NLP models play an important role in text summarization,
which focuses on distilling large amounts of information into summaries [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This allows human readers
to grasp concepts briefly and concisely.
      </p>
      <p>
        Extensive research and development have gone into languages like English, Chinese, German, French,
and Spanish, having large-scale datasets and advanced models [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Unfortunately, the same cannot
be said for Indian languages — very little attention has been given to these languages. Despite the
millions who speak these languages, eforts in creating efective NLP tools for them, particularly for
Automatic Text Summarization (ATS) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], remain scarce. Most available datasets are either too small or
inaccessible to the public, limiting their utility for meaningful research and development [
        <xref ref-type="bibr" rid="ref1 ref6">1, 6</xref>
        ].
      </p>
      <p>
        In an attempt to narrow this chasm, the Indian Language Summarization (ILSUM) shared task
was initiated. For the ILSUM 2024 edition, the dataset (publicly available corpora specifically for
summarization) has been compiled from leading national newspapers and features more than 15,000
article-headline pairs for each language, including Bengali, English, Gujarati, and Tamil. However, these
datasets contain the presence of code-mixing and script-mixing, where English phrases and Latin scripts
are interwoven with Indian-language content [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Tackling these challenges requires a nuanced
approach to the rich linguistic diversity that Indian languages represent.
      </p>
      <p>This research aims to foster the development of NLP tools that can handle the complexities of
multilingual and code-mixed content, thus making an attempt to pave the way for more inclusive and
wide-reaching innovations in the field of natural language processing.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        The following were some of the research papers that were referred while involving in the task.
Text summarization for Indian languages [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] paper by Aishwarya Krishnakumar et al. explores the
evolution of text summarization, from ancient uses to modern NLP models. While summarization is
advanced for English, Indian languages are underrepresented. The authors, participating in the FIRE
2022 ILSUM task, address this gap by comparing models like mT5_m2m_CrossSum, XL-Sum, and Bert
for code-mixed text summarization in English, Gujarati, and Hindi. They found that
mT5_m2m_CrossSum produced the most accurate summaries, earning a top-ten validation set ranking for each language.
This work highlights the efectiveness of mT5-based models for multilingual summarization in Indian
languages.
      </p>
      <p>
        A paper on text summarization techniques by Allahyari et al. (2017) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] provides a comprehensive
review of automatic text summarization techniques, addressing the growing need for concise
representations of vast text data from the Internet and other digital sources. The authors examine a
range of summarization methods, particularly focusing on extractive approaches for both single- and
multi-document summarization. These methods include topic modeling, frequency-based strategies,
graph-based approaches, and machine learning techniques, each evaluated for their efectiveness and
limitations in diferent contexts. The paper emphasizes the challenges in automatic summarization due
to the lack of human-like language understanding in machines and highlights significant advancements
and trends in the field, ofering a valuable state-of-the-art overview of summarization technology.
      </p>
      <p>
        Hahn and Mani (2000) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] explored the complexities of creating coherent summaries from diverse
sources, given the explosion of online information in their paper. Existing extraction-based tools
like Microsoft’s AutoSummarize are limited in coherence and scope. The authors discuss
knowledgepoor and knowledge-rich methods—basic rules versus extensive background knowledge—to enhance
summary quality. Summaries are classified as extracts or abstracts, with functions such as indicative,
informative, or critical, and a growing focus on user-specific needs. They highlight key challenges,
including summarizing non-textual media, multiple sources, and achieving high compression rates,
essential for advancing summarization tools.
      </p>
      <p>
        Awasthi et al. (2021) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] provided an overview of extractive and abstractive methods in automatic text
summarization in their paper on natural language processing. They emphasized unsupervised extractive
approaches, including K-Means clustering for sentence selection and the SummCoder framework, which
ranks sentences based on relevance and novelty. The study also discusses EdgeSumm, a graph-based
method using nouns as nodes for text representation. This work highlights the need for efective
summarization techniques to manage the growing volume of online information and the critical role of
NLP in advancing these methods.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Exploration on Summarization</title>
      <p>Understanding the types of text summarization is crucial before delving into Natural Language
Processing (NLP) for several reasons. Diferent summarization types (extractive vs. abstractive) require distinct
approaches and algorithms. By understanding these diferences, we can choose the most suitable models
and techniques for their specific needs, leading to more efective and eficient NLP solutions. The
diferent types of text summarization is represented in Figure 1.</p>
      <sec id="sec-3-1">
        <title>3.1. Based on Output</title>
        <p>The table presents the distribution of data across training, validation, and test sets for four
languages—Bengali, English, Gujarati, and Tamil. It highlights the number of records allocated to each phase
for each language, providing insight into the dataset’s structure for model training, hyperparameter
tuning, and performance evaluation in a multilingual context.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Task Description and Dataset</title>
      <p>
        The aim of the task is to generate a meaningful fixed length summary, either extractive or abstractive,
for each article. The dataset for this task is built using articles and headline pairs from several leading
newspapers of the country. The Table 1 presents the distribution of data across training, validation, and
test sets for four languages—Bengali, English, Gujarati, and Tamil. It highlights the number of records
allocated to each phase for each language, providing insight into the dataset’s structure for model
training and performance evaluation in a multilingual context. The train and val dataset contained id,
Heading, Summary, and Article for each language, whereas the test dataset contained id, Heading and
Article alone. More details about the dataset is presented in Table 1. The overview of the task can be
found in Findings of the First Shared Task on Indian Language Summarization (ILSUM): Approaches,
Challenges and the Path Ahead [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and FIRE 2022 ILSUM Track: Indian Language Summarization [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
More details on the dataset and additional documents [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] were also referred.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Methodology</title>
      <p>The articles are split into individual sentences. Each sentence is transformed into a vector representation,
and a similarity matrix is created by comparing the vectors. This matrix forms the basis of a graph where
sentences are nodes and edges represent sentence similarity. A ranking algorithm, such as PageRank,
is applied to the graph to rank the sentences based on their importance. Finally, the highest-ranked
sentences are selected to create a concise summary of the original text. The basic flow of Summarization
is given in Figure 2.</p>
      <sec id="sec-5-1">
        <title>5.1. Pre-processing</title>
        <p>We used proper pre-processing of the text data in diferent languages- Bengali, Tamil, Gujarati and
English. The quality and uniformity required for eficient summarization were preserved in our
experiment. The raw text files are read into binary mode in order to anticipate encoding problems when
reading text. The content was decoded using UTF-8 with error handling, to replace any problematic
characters.</p>
        <p>We used a function cleaning that was to remove the white spaces and words that had nothing to do
with the target language. We used regular expressions in many places, which meant we replaced
all sequences of whitespaces by one space and also stripped the leading and trailing spaces of the
text. This is how this step was really essential in maintaining the original integrity of content while
providing a clean dataset.</p>
        <p>In applying normalization techniques, we convert all text to lowercase and removed punctuation in
the case of Tamil and other Indic languages. Standardization was further required for eliminating
variability arising due to case sensitivity and non-alphanumeric characters to not interfere with the
summarization algorithms.</p>
        <p>Finally, after all of these cleaning processes and normalizations had been done on the data, we saved them
as new CSV files for easy access in subsequent phases of our research. This phase, that was full of detail
concerning cleaning and normalizing the data, was very important to having the best performance of our
summarization models, that is, to generate outputs which are more accurate and relevant in their context.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Models</title>
        <p>We used several models for summarization, and each of them was chosen because of their unique
strengths and the application in diferent languages involved: English, Tamil, Bengali, and Gujarati.</p>
        <p>We began with SumBasic, ie., using the very basic nature of summary creation through frequency
analysis of words. The model gives the calculation of how many times a word appears in the text
and identifies the most frequent ones amongst them. We selected sentences that contained these
high-frequency words, with a bias toward those that contributed the most to the understanding of
the document as a whole. Although we experienced SumBasic to be eficient and straightforward to
employ, we realized that frequency alone sometimes could not succeed in capturing textual subtleties
as many times this resulted in trading of the ability of summation.</p>
        <p>We now used TF-IDF, Term Frequency-Inverse Document Frequency. This model measures the
importance of each term in relation to the whole document collection. The model that we consider
consists of two main aspects- Term Frequency (TF), which counts how many times a word appears in
a document, and Inverse Document Frequency (IDF), which evaluates the importance of a word in
the whole dataset. We then scored terms by these metrics, picking those sentences that have terms
with the highest score for summary generation. This approach was able to efectively balance local
relevance with global context and thus was particularly strong in capturing the flavor of the text.</p>
        <p>We leveraged the use of the mT5 model for summarization work, which works on the transformer
architecture and has been pre-trained on various language tasks using a large multilingual dataset.
We framed summarization as a problem of text-to-text and thus allowed mT5 to natively transform
an input text into a summary. With the use of self-attention mechanism in the model, it was easy
to down-weight other words and phrases based on contextual relations. Through fine-tuning, mT5
became highly efective at producing coherent summaries while maintaining the original meaning and
context of the text. This is an excellent advantage over traditional extractive methods.</p>
        <p>
          We further developed the XLSum [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] long-form content-specific model. An XLSum model makes
use of the encoder-decoder architecture highly suited for understanding and summarizing large
documents. We preprocessed the input text in chunks capturing fine-grained details including broad
themes. Training XLSum on a wide variety of lengthy documents helped it to very eficiently condense
lengthy stories into nutshell summaries without losing any contextual information that was important.
The decoder actually chose the sentences and phrases most relevant to work, ensuring the produced
summaries were coherent and informative.
        </p>
        <p>We fine-tuned the variant mT5-Tamil, focusing on an exhaustive Tamil corpus while retaining
the core functionality and further enhancing its ability to understand unique syntactic and semantic
features of Tamil. With this adaptation, mT5-Tamil improved its capability to summarize better. The
self-attention mechanism was inherently important as it enabled mT5-Tamil to assess and decide about
the importance of each word in its context. This thematic training permitted summaries which were
perfectly accurate and contextual, centered on the intricacies of Tamil literature and the modalities of
communication.</p>
        <p>
          We also employed a multi-Indic transformer-based model, MultiIndic [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], that is trained on
multiple Indic languages. The nuances of the language in the model were very helpful in our research.
Patterns and linguistic structure found in various kinds of textual data help MultiIndic learn, creating
coherent summaries while showing respect to the linguistic context in which they were written. Of
course, its efectiveness was really clear in summarizing texts in languages with drastically divergent
structures from the English language.
        </p>
        <p>We also leveraged a language-specific variation of the BERT architecture for text in Tamil, which we
call Tamil-BERT [21]. The model employs a bidirectional attention mechanism that enables it to
look at words on either side of a token as it processes one token. This made it easier for Tamil-BERT
to capture the intricate relationships between words and phrases that define Tamil. These played an
important role in arriving at coherent, contextually rich summaries. Its training on Tamil datasets
made it learn all kinds of idiomatic expressions and other nuances of language, which further elevated
its efectiveness for summarization tasks.</p>
        <p>We further used Indic-BERT [22], which utilizes the BERT architecture to serve multiple Indic
languages. After being pre-trained with a diverse set of texts, this learned the unique characteristics of
each language. The model’s bidirectional nature allowed it to process words in context, hence greatly
improving its capacity to generate relevant summaries. This focus on understanding interactions in
words within the much larger text made Indic-BERT particularly efective for summarization tasks in
languages like Tamil, with multilingual capabilities that ensured high-quality outputs in a wide range
of contexts.</p>
        <p>It was through this multilateral approach that we attempted to achieve the very rich and multifaceted
summarization process, reflecting the quality of languages involved in our research.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Analysis</title>
      <sec id="sec-6-1">
        <title>6.1. Performance Metrics</title>
        <p>One of the main aspects of text summarization is the assessment of quality in the produced summaries.
The most commonly applied metrics to evaluate the produced summaries in this area are known as
ROUGE, or Recall-Oriented Understudy for Gisting Evaluation. ROUGE consists of a set of measures
comparing the generated summaries to one or more reference summaries prepared by human beings.
This assessment accounts for the overlap of n-grams, or contiguous sequences of words, between the
summaries generated and the reference, providing valuable insights into content coverage, fluency, and
coherence overall.</p>
        <sec id="sec-6-1-1">
          <title>ROUGE-1</title>
          <p>ROUGE-1 specifically measures the overlap of unigrams, or single words, in the generated and
reference summaries.</p>
          <p>ROUGE-1 Recall =
ROUGE-1 Precision =</p>
          <p>Number of overlapping unigrams</p>
          <p>Total unigrams in reference</p>
          <p>Number of overlapping unigrams
Total unigrams in generated summary
(1)</p>
          <p>In the equations, Matching Unigrams is the number of overlapping unigrams between the
generated summary and the reference summaries. While Generated Unigrams and Reference
Unigrams are the total number of unigrams in the generated and reference summaries, respectively.
ROUGE-1 is of use for general lexical overlap; therefore, it is the foundation measure used in
summarization evaluation.</p>
        </sec>
        <sec id="sec-6-1-2">
          <title>ROUGE-2</title>
          <p>ROUGE-2 pushes the evaluation further through to bigrams, which in turn gives an enriched view of
the relations of the context between and among the words of generated text. It uses the same precision
and recall formulas except that they zero in on bigram matching rather than individual words. What
it captures is the words and the relation that a bigram might hold where its relationship with the
consecutive words has improved the efectiveness in judging coherence and flow during generated
summaries. Calculations follow ROUGE-1’s approach except this now is in the count and numbers of
bigrams.</p>
          <p>ROUGE-2 Recall =
ROUGE-2 Precision =</p>
          <p>Number of overlapping bigrams</p>
          <p>Total bigrams in reference</p>
          <p>Number of overlapping bigrams</p>
          <p>Total bigrams in generated summary
ROUGE-2 F1 = 2 × Recall × Precision</p>
          <p>Recall + Precision
As shown in Equations 4, 5, and 6, the ROUGE-2 metrics are based on bigram overlaps.
(3)
(4)
(5)
(6)
(7)
(8)</p>
        </sec>
        <sec id="sec-6-1-3">
          <title>ROUGE-L</title>
          <p>ROUGE-L measures LCS between the extracted and target summaries. Here, an LCS is measured as
”the longest subsequence common to both and of matching words.” Thereby comparing the word order
in addition to the structure, ROUGE-L tends to give coherence a wider sense. Calculation of ROUGE-L
precision, recall and F1 is as indicated below:</p>
          <p>ROUGE-L Recall =</p>
          <p>LCS length</p>
          <p>Total words in reference
ROUGE-L Precision =</p>
          <p>LCS length</p>
          <p>Total words in generated summary
ROUGE-L F1 = 2 × Recall × Precision (9)</p>
          <p>Recall + Precision</p>
          <p>As shown in Equations 7, 8, and 9, the ROUGE-L metrics use the longest common subsequence (LCS)
length.</p>
          <p>Here, LCS is the length of the longest common subsequence, and Generated Summary and
Reference Summary represent the number of words in each of the summaries. ROUGE-L is very
efective in verifying the structural cohesion of generated summaries since it is sensitive to content and order.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Evaluation Procedure</title>
        <p>The evaluation process with ROUGE scores follows some systematic steps. Before that, researchers
would prepare by gathering a set of reference summaries along with their generated summaries. After
the tokenization of generated and reference summaries into their constituent n-grams, each metric will
count the number of matching n-grams.</p>
        <p>Following this, precision, recall, and F1-scores for ROUGE-1, ROUGE-2, and ROUGE-L are calculated
according to the formulas above. This ultimately produces scores for comparison over the quality of
summaries generated against the reference summaries that were used.</p>
        <p>With ROUGE scores, the quantitative analysis of summarization algorithm performance can be
done for even better models to be developed. As a result, this means there would be improvement in
natural language processing automated summarization. The scheme is a full evaluation in pushing
summarization technologies while making sure that produced summaries will not be substandard to
certain levels of accuracy and coherence.</p>
        <p>Table 2 summarizes ROUGE-1 scores for various models on four languages: Tamil, English, Gujarati,
and Bengali. The scores here measure the ability of each model to match individual words between
the generated summaries and the reference summaries. The models include classical methods such as
SumBasic and Freq Based, as well as transformer-based models like mT5, XLSum, and various
languagespecific models like Tamil-BERT and Indic-BERT. The results show a variation in performance across
languages and models, with Freq Based achieving the highest ROUGE-1 scores in English and Gujarati,
while models like mT5 and XLSum perform better in certain languages. Tamil-BERT and MultiIndic
have relatively low scores, which may mean that they need further optimization for these tasks. In
general, the table indicates how various summarization techniques perform in diferent languages. Both
traditional and modern models give valuable insights into multilingual summarization tasks.</p>
        <p>Table 3 reports the ROUGE-2 scores, which measure the overlap of bigrams (two consecutive words)
between the summaries generated and the reference summaries. This metric is a stricter measure than
ROUGE-1, requiring better understanding and generation of context. In this table, the results show that
Freq Based and mT5 consistently deliver higher ROUGE-2 scores, particularly in languages like English
and Gujarati, indicating that these models are better at capturing contextual relationships between
words. SumBasic also does reasonably well across languages, though its scores are typically all lower
than more advanced models. Tamil-BERT and MultiIndic fare worse in this test, particularly in English,
which suggests the models are less efective at creating coherent bigrams in these languages. Overall,
the ROUGE-2 scores indicate that the higher-advanced models, Freq Based and mT5 have a stronger
capability of capturing the syntactic relationship between words in more than one language.</p>
        <p>The ROUGE-L scores of Table 4 present the number of LCS between the summaries created through
the model and those manually created in the references. ROUGE-L considers the structure and order of
the entire summary produced. Hence, it is a better quality measure for summarization. Results in the
table indicate that ROUGE-L scores have been more or less in similar trend with the scores obtained by
ROUGE-1 and ROUGE-2. In this context, models like Freq Based and TF-IDF scored more significantly
for both English and Gujarati. Notably, all languages of the mT5 model have a relatively poor score,
which indicates that it fails in maintaining the sentence structure and coherence. Tamil-BERT and
Indic-BERT also show lower efectiveness in other languages except the Tamil and Gujarati languages.
Overall, the ROUGE-L scores depict how diferent models handle summary coherence and structures,
where classical methods Freq Based and TF-IDF have proven to maintain the quality at the sentence
level from diverse languages.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Performance Analysis</title>
        <p>This paper proved the efectiveness of various text summarization models that work with four
languages, which are in this case English, Tamil, Bengali and Gujarati. Amongst these models, the
Frequency Based model turned out to be the best-working model for English, Gujarati and Bengali
whereas the mt5-Tamil produced the highest scores for Tamil. The Frequency Based summarization
model garnered impressive ROUGE-1, ROUGE-2, and ROUGE-L scores. The success of Frequency
Based is attributed to the fact that it simply works on significant word occurrences. Hence, it is capable
of successfully distilling key information without losing contextual relevance. This feature makes it
pretty suitable for richly morphological languages, where the discovery of major words may greatly
determine the quality of the summary. The value of rogue scores that were obtained by the val dataset
is given in Table 2, Table 3 and Table 4.</p>
        <p>In Tamil, the best result was depicted by the mT5-Tamil at 0.0963 ROUGE-1; the Frequency Based
model having a ROUGE-1 score of 0.0955 showed the second-best score. Such a specially tailored
training on data specific to Tamil proves to be quite helpful for the model when understanding
the nuances in the language, underlining the case for language-specific adaptations. The advanced
architecture along with the contextual understanding makes mT5 a great tool for Tamil summarization.</p>
        <p>For Gujarati and Bengali, Frequency Based summarization has shown endurance consistently,
thereby further solidifying its capabilities across languages. The results point towards the need to utilize
summarization methods suited to the fine-tuned syntactic and semantic characteristics of each language.</p>
        <p>However, it is also to be noted that this study does have some limitations. The frequency-based
methods used may result in summaries that, although accurate on key terms, are often shallow and
superficial, perhaps missing important contextual information. Also, models may vary based on the
quality and size of training datasets for each language, which may adversely afect low-resource
languages with fewer resources.</p>
        <p>With advanced models, user feedback, and hybrid approaches combining extractive and abstractive
techniques, we see wide potential in the improvement of summarization techniques. Future research
can work on neural networks that, sooner rather than later, could push to deliver deeper context and
semantics awareness to foster better quality in summaries. Additionally, user-centric features and
interaction with summarization tools can improve the practical applicability of these models as much
as possible and make it even more responsive to the user’s needs.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Results</title>
      <p>The submission for the Gujarati data was ranked 4th. The performance results are recorded in Table 5.
The submission for the Bengali data was ranked 4th. The performance results are recorded in Table 6.
The submission for the Tamil data was ranked 4th. The performance results are recorded in Table 7.
The submission for the English data was ranked 7th. The performance results are recorded in Table 8.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>In general, this work has proven that the approaches based on frequency are powerful in terms of
generative summaries but limited in someway and can be complemented by integrating them with
some more sophisticated models. Techniques such as TF-IDF and basic n-gram approaches do an
excellent job for lexical frequency to retrieve vital content but lose focus from context, semantics, and
coherence of the generated summaries. We can further strengthen more complex models such as mT5,
by using transformer architectures and attention mechanisms, in the interest of combining the best
characteristics of frequency-based techniques. These result in subtler and information-rich summaries
that hold up the content but also maintain the subtlety of language and meaning.</p>
      <p>Models like mT5 can introduce the ability of having a deeper understanding of complex linguistic
structures and relations in text, hence allowing it to contextualize and therefore produce coherent
summaries. Integrating frequency-based methods with models is thus a hybrid approach to stand to
draw strength from both sides. For instance, it can highlight highly frequent key phrases and concepts
that the transformer model will use for good. The summary can then be in coherent narrative with
much more depth and meaning for the source text than just the aggregate of the high frequency terms.
The present research has developed new ways in which such an integrated approach might further
ifnd application across the entire range of linguistic contexts. With changing requirements of
text summarization, adaptation to various languages, dialects, and genres is needed. Through
frequency-based techniques on mT5 models, we can unlock more robust and adaptive summarization
solutions that could potentially reach a larger audience. The methodology proposed here marks the
beginning of future work with increased sophistication in summarization techniques in terms of how
well they function and the nuances of the human language. Ultimately, the results here might enable
information to become more readily available across domains and make the job of users easier in
discerning meaning from large amounts of textual data.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT, Grammarly in order to: Grammar
and spelling check, Paraphrase and reword. After using this tool/service, the author(s) reviewed and
edited the content as needed and take(s) full responsibility for the publication’s content.
[21] R. Joshi, L3cube-hindbert and devbert: Pre-trained bert transformer models for devanagari based
hindi and marathi languages, arXiv preprint arXiv:2211.11418 (2022).
[22] D. Kakwani, A. Kunchukuttan, S. Golla, G. N.C., A. Bhattacharyya, M. M. Khapra, P. Kumar,
IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language
Models for Indian Languages, in: Findings of EMNLP, 2020.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <article-title>Key advances in natural language processing: A 2023 review</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>67</volume>
          (
          <year>2023</year>
          )
          <fpage>34</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Santy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Budhiraja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Choudhury</surname>
          </string-name>
          ,
          <article-title>The state and fate of linguistic diversity and inclusion in the nlp world</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6282</fpage>
          -
          <lpage>6293</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>557</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O.</given-names>
            <surname>Bojar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Buck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Federico</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Graham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Haddow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Koehn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leveling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Monz</surname>
          </string-name>
          , et al.,
          <source>Findings of the 2014 workshop on statistical machine translation</source>
          ,
          <source>in: Proceedings of the Ninth Workshop on Statistical Machine Translation, Association for Computational Linguistics</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>12</fpage>
          -
          <lpage>58</lpage>
          . URL: https://aclanthology.org/W14-3301.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Wijayanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Khodra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Surendro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Widyantoro</surname>
          </string-name>
          ,
          <article-title>Learning bilingual word embedding for automatic text summarization in low resource language</article-title>
          ,
          <source>Journal of King Saud University-Computer and Information Sciences</source>
          <volume>35</volume>
          (
          <year>2023</year>
          )
          <fpage>224</fpage>
          -
          <lpage>235</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hedderich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lange</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Adel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Strötgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Klakow</surname>
          </string-name>
          ,
          <article-title>A survey on recent approaches for natural language processing in low-resource scenarios</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>12309</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Challenges and considerations with code-mixed nlp for multilingual societies</article-title>
          ,
          <source>arXiv preprint arXiv:2106.07823</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Thara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Poornachandran</surname>
          </string-name>
          ,
          <article-title>Code-mixing: A brief survey</article-title>
          , in: 2018 International conference
          <article-title>on advances in computing, communications and informatics (ICACCI)</article-title>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>2382</fpage>
          -
          <lpage>2388</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krishnakumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Naushin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bharathi</surname>
          </string-name>
          ,
          <article-title>Text summarization for indian languages using pre-trained models</article-title>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Allahyari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pouriyeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Assefi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Safaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Trippe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kochut</surname>
          </string-name>
          ,
          <article-title>Text summarization techniques: A brief survey</article-title>
          ,
          <year>2017</year>
          . URL: https://arxiv.org/abs/1707.02268. arXiv:
          <volume>1707</volume>
          .
          <fpage>02268</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>U.</given-names>
            <surname>Hahn</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Mani</surname>
          </string-name>
          ,
          <article-title>The challenges of automatic summarization</article-title>
          ,
          <source>Computer</source>
          <volume>33</volume>
          (
          <year>2000</year>
          )
          <fpage>29</fpage>
          -
          <lpage>36</lpage>
          . doi:
          <volume>10</volume>
          .1109/2.881692.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>I.</given-names>
            <surname>Awasthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Bhogal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. K.</given-names>
            <surname>Soni</surname>
          </string-name>
          ,
          <article-title>Natural language processing (nlp) based text summarization - a survey</article-title>
          ,
          <source>in: 2021 6th International Conference on Inventive Computation Technologies (ICICT)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1310</fpage>
          -
          <lpage>1317</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICICT50816.
          <year>2021</year>
          .
          <volume>9358703</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <article-title>Findings of the first shared task on indian language summarization (ILSUM): approaches challenges and the path ahead</article-title>
          , in: K. Ghosh,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , M. Mitra (Eds.), Working Notes of FIRE 2022 -
          <article-title>Forum for Information Retrieval Evaluation, Kolkata</article-title>
          , India, December 9-
          <issue>13</issue>
          ,
          <year>2022</year>
          , volume
          <volume>3395</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>369</fpage>
          -
          <lpage>382</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3395</volume>
          /
          <fpage>T6</fpage>
          -1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          , P. Mehta,
          <article-title>FIRE 2022 ILSUM track: Indian language summarization</article-title>
          , in: D.
          <string-name>
            <surname>Ganguly</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gangopadhyay</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Mitra</surname>
          </string-name>
          , P. Majumder (Eds.),
          <source>Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation</source>
          ,
          <string-name>
            <surname>FIRE</surname>
          </string-name>
          <year>2022</year>
          , Kolkata, India, December 9-
          <issue>13</issue>
          ,
          <year>2022</year>
          , ACM,
          <year>2022</year>
          , pp.
          <fpage>8</fpage>
          -
          <lpage>11</lpage>
          . URL: https://doi.org/10.1145/3574318.3574328. doi:
          <volume>10</volume>
          .1145/ 3574318.3574328.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <source>Indian language summarization at fire</source>
          <year>2023</year>
          ,
          <year>2024</year>
          , pp.
          <fpage>27</fpage>
          -
          <lpage>29</lpage>
          . doi:
          <volume>10</volume>
          .1145/3632754.3634662.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <article-title>Key takeaways from the second shared task on indian language summarization</article-title>
          (ilsum
          <year>2023</year>
          ), in: Fire,
          <year>2023</year>
          . URL: https://api.semanticscholar.org/ CorpusID:269791803.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. HL</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <article-title>Overview of the third shared task on indian language summarization</article-title>
          (ilsum
          <year>2024</year>
          ), in: K. Ghosh,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , D. Ganguly (Eds.), Working Notes of FIRE 2024 -
          <article-title>Forum for Information Retrieval Evaluation, volume CEUR-WS</article-title>
          .
          <source>org of CEUR Workshop Proceedings</source>
          , Gandhinagar, India,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hegde</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. HL</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <article-title>Key insights from the third ilsum track at fire 2024, in: Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation</article-title>
          ,
          <string-name>
            <surname>FIRE</surname>
          </string-name>
          <year>2024</year>
          , ACM, Gandhinagar, India,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhattacharjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mubasshir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-F.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-B.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Rahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shahriyar</surname>
          </string-name>
          ,
          <article-title>XL-sum: Large-scale multilingual abstractive summarization for 44 languages, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021</article-title>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>4693</fpage>
          -
          <lpage>4703</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .findings-acl.
          <volume>413</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shrotriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sahu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dabre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Puduppully</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kunchukuttan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Khapra</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Indicnlg suite: Multilingual datasets for diverse nlg tasks in indic languages</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2203.05437.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>