<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text Summarization for Indian Languages: Finetuned Transformer Model Application ⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>V Ilanchezhiyan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>R Darshan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>E M Milin Dhitshithaa</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>B Bharathi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of CSE, Sri Siva Subramaniya Nadar College of Engineering</institution>
          ,
          <addr-line>Rajiv Gandhi Salai, Chennai, Tamil Nadu</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Summarization, like distilling ancient wisdom, has stood the test of time. From ancient storytelling to modern contexts like meetings, the act of condensing information remains relevant. This abstracts itself exemplifies the ongoing use of summarization. In today's digital age, where technology mimics once-exclusive human tasks, text summarization is a notable example. Natural Language Processing (NLP) and AI models have automated this skill, though attention to Indian languages remains sparse. This paper explores the work on ILSUM (Indian Language Summarization) in the FIRE 2023 task, comparing existing models. Our m2023 model achieved the second position for English in FIRE 2023 rankings. We used mT5-small and mT5-base with fine tuned T5-base: m2023-t5-base, excelling in generating precise summaries.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Generated Summary</kwd>
        <kwd>Indian Languages</kwd>
        <kwd>Pre-Trained Model</kwd>
        <kwd>m2023-t5-base</kwd>
        <kwd>Hindi</kwd>
        <kwd>Gujarati</kwd>
        <kwd>Bengali</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Natural Language Processing (NLP), a field at the intersection of computer science and artificial
intelligence (AI), integrates computational linguistics with statistical machine learning and deep
learning models. It focuses on developing rule-based models for human language, enabling
computers to process both voice data and text. This capability allows computers to read text,
comprehend speech, and derive meaning from it. NLP breaks down language into tokens, seeking
to understand the relationships between these tokens. The spectrum of NLP tasks includes
sentiment analysis, word sense disambiguation, grammatical tagging, content categorization,
text summarization, topic discovery and modelling, speech-to-text, and vice versa, among others.
These tasks encounter challenges in achieving accuracy due to the inherent ambiguities in
human language, as well as the diversity of languages, the use of metaphors, sarcasm, idioms,
and various grammatical exceptions.</p>
      <p>
        Text summarization stands as a pivotal application within Natural Language Processing
(NLP), serving the purpose of condensing large volumes of textual information into concise
and meaningful summaries. This process, often referred to as Automatic Text Summarization
(ATS), is crucial in extracting the most salient information from extensive documents, enabling
eficient content digestion and comprehension. As information overload becomes an increasingly
prevalent challenge in the digital age, the significance of text summarization has grown, ofering
a solution to distil key insights from vast datasets. These are several papers for utilising
pre-trianed models for text summarisation. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] While numerous NLP tools and approaches
address these challenges efectively, their applicability is often constrained when dealing with
low-resource languages. Text summarization, a key application of NLP techniques, involves
processing extensive digital text from sources like articles, magazines, or social media. The
goal is to generate concise summaries and synopses for indexes, research databases, or
timeconstrained readers. Automatic Text Summarization refers to the computer-driven execution of
this task using algorithms and programs. The text summarization algorithms are compared and
contrast in this paper [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <sec id="sec-1-1">
        <title>1.1. Summarization Techniques</title>
        <p>Various techniques are employed in text summarization, each leveraging advancements in
computational linguistics, machine learning, and deep learning. Rule-based models, which
rely on predefined linguistic rules, are foundational in summarization. These models extract
important sentences or phrases based on syntactic and semantic structures. Statistical methods,
such as frequency analysis and TF-IDF (Term Frequency-Inverse Document Frequency), quantify
the importance of words and sentences to identify key content. Machine learning techniques,
particularly supervised methods, utilize labelled data to train models to discern relevance and
generate summaries. Deep learning models, including recurrent neural networks (RNNs) and
transformers, have revolutionized summarization by capturing intricate contextual relationships
within the text.</p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. Types of Models</title>
        <p>Text summarization encompasses diverse approaches, with extractive and abstractive
summarization being the primary paradigms. Extractive summarization involves selecting and
rearranging existing sentences from the source text to form a summary. This method relies
on sentence importance scores calculated through various algorithms, such as graph-based
models or machine learning classifiers. Abstractive summarization, on the other hand, goes
beyond extraction by generating new sentences that capture the essence of the source text. This
process requires an understanding of the content and often involves complex NLP models, such
as transformers, that can paraphrase and rephrase information to create coherent summaries. In
recent years, pre-trained language models, such as BERT (Bidirectional Encoder Representations
from Transformers) and GPT (Generative Pre-trained Transformer), have gained prominence in
text summarization. These models, trained on vast amounts of diverse text data, demonstrate
superior language understanding and generation capabilities. Fine-tuning these models for
summarization tasks has yielded state-of-the-art results in producing human-like summaries.
In this dynamic landscape, the exploration of novel architectures, hybrid models, and linguistic
innovations continues to drive advancements in text summarization. This comprehensive
overview aims to elucidate the multifaceted nature of summarization, showcasing its evolution
from traditional rule-based systems to cutting-edge deep learning approaches, and highlighting
the ongoing quest for more efective and linguistically nuanced models. The remainder of the
paper is organised as follows: Section 2 discusses the literature survey of related works. The
descriptions of data and the proposed methods are detailed in Sections 3 and 4 respectively.
Section 5 underscores the achieved results and presents a thorough analysis of the performances
of each model. Section 6 concludes the paper and contains a discussion about future research.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The literature on Indian language summarization spans various languages, with a focus on
techniques and approaches tailored to specific linguistic nuances. A work [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] employs neural
network models for abstractive summarization in Hindi, leveraging the Transformer architecture
for improved contextual understanding.
      </p>
      <p>In the context of English, research [4] delves into the application of BERT-based models
for summarization tasks, demonstrating their efectiveness across diverse datasets. Another
study [5] explores the use of reinforcement learning in abstractive summarization, showcasing
advancements in generating coherent and concise summaries.</p>
      <p>For Bengali, a noteworthy paper [6] investigates the challenges and opportunities in Bengali
language summarization. The work emphasizes the need for language-specific models to capture
the intricacies of Bengali, showcasing the development of a summarization system tailored to
this language’s linguistic characteristics.</p>
      <p>In Gujarati, research [7] focuses on leveraging pre-trained language models for summarization
tasks. The study provides insights into the adaptability and performance of such models in the
context of Gujarati language summarization.</p>
      <p>To further enrich the literature survey, a comparative analysis of these studies across Hindi,
English, Bengali, and Gujarati reveals the diverse methodologies employed, ranging from
graph-based approaches to neural networks and pre-trained models. These works collectively
contribute to the ongoing advancements in Indian language summarization, acknowledging the
linguistic diversity and unique challenges posed by each language.</p>
      <p>The previous papers [8][9][10][11][12] on the shared tasks organised by FIRE ranging from
diferent approaches and models provided various perspectives in visualising a solution to the
problem.</p>
      <p>In the exploration of the task, dataset, and methodologies employed by various participants,
references to the shared task papers ofer comprehensive insights. The work by [ 13] delves
into the intricacies of Indian Language Summarization at FIRE 2023, encompassing diverse
approaches and dataset considerations. Similarly, the collaborative efort outlined in [ 14]
provides a comprehensive overview of the Second Shared Task on Indian Language Summarization
(ILSUM 2023), elucidating the strategies adopted by multiple teams.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiment Dataset</title>
      <p>The following section provides a detailed description of the data used in this study as well
as the preprocessing techniques employed. The undertaken task has also been discussed
comprehensively.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Description</title>
        <p>The dataset for this task comprises approximately 15,000 news articles in each language, drawn
from leading newspapers in the country. The objective is to generate a meaningful fixed-length
summary, be it extractive or abstractive, for each article. Notably, the dataset introduces a
unique challenge of code-mixing and script mixing, where phrases from English are commonly
integrated into news articles, even when the primary language is an Indian language.</p>
        <p>A distinctive feature of this dataset is its inclusion of examples exhibiting code-mixing in
both headlines and articles, reflecting the common practice of borrowing English phrases within
Indian language content. For instance:</p>
        <p>The dataset is structured into four CSV files: english-train.csv, hindi-train.csv,
gujaratitrain.csv, and bengali-train.csv. Each file contains columns for articles and summaries, providing
a comprehensive foundation for training and evaluating models on the task of generating
meaningful summaries for news articles in diverse Indian languages, while accommodating the
intricacies of code-mixing and script mixing.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Task Description</title>
        <p>Generate concise and meaningful fixed-length summaries for news articles in multiple Indian
languages, considering the challenge of code-mixing and script mixing. The dataset includes
articles and headline pairs from leading newspapers in English, Hindi, Gujarati, and Bengali.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Data Pre-processing</title>
        <p>The dataset employed in this study comprises articles and summary in the languages English
and Hindi.</p>
        <p>Special characters and punctuations occur frequently in the dataset. No preprocessing step
is required to scan through a list of special characters and replace special characters with a
space in the text data of the training, validation, and test datasets as the context is important
in generating summaries. Further, entries with missing values or labels were not present as
the provided dataset was pre-processed before. Data Cleaning: The Given dataset didn’t have
any null characters, empty lines or newline characters. Although the data was mixed with html
or xml entiity codes. Those codes had to be replaced with its corresponding characters such
as "&amp;amp;#8217" with ",", "&amp;amp;#8211;" with "–", "&amp;amp;#8220;", "“", "&amp;amp;#8221;", "”" and
"&amp;amp;#8216;", "‘". Without this pre-processing step the model generated random words along
with html or xml entity codes.</p>
        <p>Before pre-processing: Heavy rains in Chennai&amp;amp;#8217 Pondicherry and Kerala...</p>
        <p>After pre-processing: Heavy rains in Chennai, Pondicherry and Kerala</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed Methodology</title>
      <p>This section details in-depth explanations for each of the experiments conducted for the shared
task. Figure 1 depicts the general flow of the process.</p>
      <sec id="sec-4-1">
        <title>4.1. Multilingual Summarization with T5-base and Translation</title>
        <p>In the initial stages of this project, the selection of the T5-base model as the primary
summarization model was informed by a meticulous comparative analysis. This analysis demonstrated that
T5-base consistently outperformed its smaller counterpart, T5-small, ofering more dependable
and contextually accurate summaries. The decision to leverage T5-base aligns with the project’s
overarching objective of achieving high-quality summarization results across the linguistically
diverse landscape of Indian English and Hindi.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Comparative Analysis for Model Selection</title>
        <p>To establish the eficacy of T5-base, a comparative analysis was conducted, considering
various summarization models. The results indicated that T5-base consistently demonstrated
superior performance, making it the model of choice for this multilingual summarization task.
The model’s ability to handle diverse linguistic nuances and generate contextually accurate
summaries influenced this selection.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation Metrics</title>
        <p>The efectiveness of the summarization models is assessed using two key metrics: ROUGE
(RecallOriented Understudy for Gisting Evaluation) and BERT (Bidirectional Encoder Representations
from Transformers) scores. These metrics provide a comprehensive evaluation of the generated
summaries:</p>
        <p>ROUGE Scores: ROUGE evaluates the overlap between the generated summaries and reference
summaries, ofering a quantitative measure of the content’s quality. It assesses the precision of
the model in capturing essential information from the source articles.</p>
        <p>BERT Scores: BERT scores gauge the semantic similarity between the generated and reference
summaries. This metric delves into the contextual understanding of the model, providing insights
into how well the summarization captures the underlying meaning of the source text.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Main Workflow</title>
        <p>The primary workflow of the proposed methodology involves the following steps:
Fig 1. Flow Diagram of the Main Workflow
Positional
Encoding
Encoded
Layers
Decoded</p>
        <p>layers
Translator
embedding
Translation</p>
        <p>Decoder
Output
Summary</p>
        <p>Token
Embedding</p>
        <p>Self
Attention
(Encoder)</p>
        <p>Feed
Forward</p>
        <p>NN
Masked</p>
        <p>Selfatt(Decoder)</p>
        <p>Cross
Attention
(Decoder)</p>
        <p>Token
Generation</p>
        <p>and
Softmax</p>
        <p>Back
Translation</p>
        <p>Decoder
• Translation to English: Articles in Indian languages (Hindi) are translated into English
using a language translation model. This step aims to leverage the extensive English
datasets available for training the summarization model.
• Summarization with T5-base: The translated articles are fed into the T5-base model,
they are fine-tuned using the FIRE 2023 dataset specific to the summarization task. The
model generates English summaries that capture the key information from the translated
articles and generate summaries in english that need to be back translated.
• Back-Translation: The generated English summaries undergo a subsequent step where
they are translated back into the original Indian languages using the translation model.
This phase ensures that the final summaries are linguistically accurate thereby presenting
information in the desired languages. By leveraging the translation capabilities of the
model, we enhance the overall summarization process, allowing for broader accessibility
and understanding of the content in the targeted linguistic contexts. This iterative
approach of integrating translation back into the summarization workflow, contributes to
appropriate summaries.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Performance Evaluation</title>
        <p>The performance of the proposed methodology is quantitatively assessed using ROUGE and
BERT scores, providing a comprehensive understanding of the summarization quality across
multiple languages. The efectiveness of the model is validated through its ability to generate
contextually accurate and linguistically appropriate summaries for diverse linguistic inputs.</p>
        <p>This proposed methodology integrates the strengths of T5-base, translation models, and
robust evaluation metrics, ofering a comprehensive approach to multilingual summarization
that aligns with the specific linguistic challenges posed by Indian English and Hindi
Model
T5-base
T5-small
T5-base
T5-small</p>
        <p>Language
English
English
Hindi
Hindi</p>
        <p>Initially before fine-tuning the model with the FIRE 2023 dataset, the model performance was
very poor and generated inaccurate summaries.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>The results for the model can be comparatively analysed using table 1 and table 2. There is a
much significant improvement in the performances of the model summarising capability on the
translation, fine-tuning and back-translation.</p>
      <p>Metric
English
Hindi</p>
      <p>ROUGE-1</p>
      <p>ROUGE-2</p>
      <p>ROUGE-L</p>
      <p>BERT F1 Score</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this comprehensive investigation into article summarization across Indian languages,
encompassing Hindi and English, our study delved into the evaluation metrics, particularly focusing
on ROUGE and BERT scores for English. The meticulous analysis of these scores provides
valuable insights into the efectiveness of our summarization techniques.</p>
      <p>For English, the ROUGE scores revealed commendable results: ROUGE-1 at 0.3022, ROUGE-2
at 0.1111, ROUGE-4 at 0.042, and ROUGE-L at 0.2504. These scores reflect the precision of
our models in capturing unigram, bigram, and long-range dependencies, showcasing their
proficiency in generating accurate and coherent summaries.</p>
      <p>The BERT scores further substantiated the success of our summarization approach: Precision
(P) at 0.8505, Recall (R) at 0.8733, and F1 Score at 0.8616. These scores emphasize the model’s
ability to comprehend and reproduce the semantic nuances present in the source articles,
indicating a robust understanding of contextual information.</p>
      <p>The positive outcomes in English summarization lay a strong foundation for extending these
techniques to other Indian languages, addressing the linguistic diversity inherent in the dataset.</p>
      <p>In conclusion, our study not only attests to the eficacy of our summarization models, as
evidenced by the impressive ROUGE and BERT scores for English, but also sets the stage for
further exploration and adaptation of these methodologies for Indian languages. The promising
results underscore the potential of our techniques to enhance summarization across diverse
linguistic landscapes, contributing to the advancement of natural language processing in the
context of Indian languages.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgements</title>
      <p>We thank the FIRE 2023 organising committee for conducting this shared task for Indian
Language Summarization.
[4] Agrawal, A., Jain, R., Divanshi and Seeja, K.R., 2023, February. Text Summarisation Using
BERT. In International Conference On Innovative Computing And Communication (pp.
229-242). Singapore: Springer Nature Singapore.
[5] Paulus, R., Xiong, C. and Socher, R., 2017. A deep reinforced model for abstractive
summarization. arXiv preprint arXiv:1705.04304.
[6] Rahman, A., Rafiq, F.M., Saha, R. and Rafian, R., 2018. Bengali text summarization using
TextRank, Fuzzy C-means and aggregated scoring techniques (Doctoral dissertation, BRAC
University).
[7] Kumari, K. and Kumari, R., 2022. An Extractive Approach for Automated Summarization
of Indian Languages using Clustering Techniques. In Forum for Information Retrieval
Evaluation (Working Notes)(FIRE). CEUR-WS. org.
[8] Satapara, S., Modha, B., Modha, S. and Mehta, P., 2022, December. Fire 2022 ilsum track:
Indian language summarization. In Proceedings of the 14th Annual Meeting of the Forum
for Information Retrieval Evaluation (pp. 8-11).
[9] Satapara, S., Modha, B., Modha, S. and Mehta, P., 2022. Findings of the first shared task
on indian language summarization (ilsum): Approaches, challenges and the path ahead.</p>
      <p>Working Notes of FIRE, pp.9-13
[10] Singh, S., Singh, J.P. and Deepak, A., 2022, December. Deep Learning based Abstractive
Summarization for English Language. In Working Notes of FIRE 2022-Forum for
Information Retrieval Evaluation, Kolkata, India.
[11] Agarwal, A., Naik, S. and Sonawane, S., 2022, December. Abstractive Text
Summarization for Hindi Language using IndicBART. In Working Notes of FIRE 2022-Forum for
Information Retrieval Evaluation, Kolkata, India.
[12] Urlana, A., Bhatt, S.M., Surange, N. and Shrivastava, M., 2023. Indian language
summarization using pretrained sequence-to-sequence models. arXiv preprint arXiv:2303.14461.
[13] Shrey Satapara, Parth Mehta, Sandip Modha, and Debasis Ganguly. Indian Language
Summarization at FIRE 2023. In Proceedings of the 15th Annual Meeting of the Forum for
Information Retrieval Evaluation, FIRE 2023, Goa, India. December 15-18, 2023. ACM, 2023.
[14] Shrey Satapara, Parth Mehta, Sandip Modha, and Debasis Ganguly. Key Takeaways from
the Second Shared Task on Indian Language Summarization (ILSUM 2023). In Working Notes
of FIRE 2023 - Forum for Information Retrieval Evaluation, Goa, India. December 15-18, 2023.
Edited by Kripabandhu Ghosh, Thomas Mandl, Prasenjit Majumder, and Mandar Mitra.
CEUR Workshop Proceedings, CEUR-WS.org, 2023.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Krishnakumar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naushin</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mrithula</surname>
            ,
            <given-names>K.L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Bharathi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <year>2022</year>
          , December.
          <article-title>Text summarization for Indian languages using pre-trained models</article-title>
          .
          <source>In Working Notes of FIRE 2022-Forum for Information Retrieval Evaluation</source>
          , Kolkata, India.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Mackie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCreadie</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ounis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <year>2014</year>
          .
          <article-title>Comparing algorithms for microblog summarisation</article-title>
          .
          <source>In Information Access Evaluation. Multilinguality, Multimodality, and Interaction: 5th International Conference of the CLEF Initiative, CLEF</source>
          <year>2014</year>
          ,
          <article-title>Shefield</article-title>
          , UK,
          <source>September 15-18</source>
          ,
          <year>2014</year>
          . Proceedings 5 (pp.
          <fpage>153</fpage>
          -
          <lpage>159</lpage>
          ). Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Varma</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <year>2018</year>
          .
          <article-title>Neural approaches towards text summarization</article-title>
          .
          <source>International Institute of Information Technology Hyderabad.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>