<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Abstractive Summarization of large articles in Hindi Language using IndicBART</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chaitanya Subhedar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sheetal Sonawane</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SCTR's Pune Institute of Computer Technology</institution>
          ,
          <addr-line>Pune</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Natural Language Processing (NLP) is rapidly evolving, with significant advancements in tasks such as translation, summarization, and language understanding across multiple languages. The field has made remarkable progress in enabling machines to comprehend and generate human language, bridging communication gaps. However, summarization for Indian languages, especially low-resource ones like Hindi, remains a challenging problem due to limited availability of annotated datasets and linguistic diversity. This paper presents an approach using the IndicBART model, a pre-trained language model designed for Indian languages, to generate coherent and meaningful summaries for Hindi text, addressing the unique challenges associated with Indian language summarization. This approach also discusses the possibility of dividing the articles into smaller chunks to generate sub-summaries when compute resources are not available, and a smaller model is required to be used.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Text Summarization</kwd>
        <kwd>IndicBART</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Natural Language Processing (NLP) has seen rapid advancements over the past decade, enabling
machines to perform tasks such as translation, summarization, and sentiment analysis across diverse
languages. These developments have significantly improved the ability to bridge language barriers
and facilitate multi-language communication. However, challenges remain, especially in the context of
low-resource languages, where the availability of high-quality annotated data is limited, hindering the
efectiveness of various NLP models.</p>
      <p>In the case of Indian languages, summarization poses a unique challenge due to the vast linguistic
diversity and complex syntactic structures. While some progress has been made for widely spoken
languages like Hindi, there is still a need for more refined approaches that can accurately capture the
nuances and produce coherent summaries. This paper addresses these challenges by focusing on Hindi
language summarization using pre-trained language models.</p>
      <p>One of the initial approaches explored for Hindi summarization was the use of the IndicBERT model, a
pre-trained language model specifically designed for Indian languages. IndicBERT has shown promising
results in various language understanding tasks; however, it encountered significant challenges in
tokenizing Hindi text for summarization. Specifically, the tokenizer often stripped away vowel markers
and other diacritics, which are crucial for meaning in Hindi. This resulted in distorted tokenized
representations that compromised the quality of generated summaries. The inability to retain essential
linguistic components made IndicBERT unsuitable for the task, as it failed to adequately capture the
semantic nuances of the original text.</p>
      <p>Another approach tested was the mT5 model, a multilingual variant of the T5 (Text-to-Text Transfer
Transformer) model that supports over 100 languages, including Hindi. The mT5-small model was
chosen due to computational resource constraints. However, it struggled to learn efectively during the
training phase, yielding low ROUGE scores that did not show significant improvement over successive
epochs. While larger versions of mT5, such as mT5-base or mT5-large, could potentially ofer better
performance, upgrading to these models was not feasible due to the limited availability of GPU memory.</p>
      <p>Consequently, mT5-small was found to be inadequate for producing high-quality summaries, particularly
for a complex language like Hindi.</p>
      <p>IndicBART emerged as a more suitable model for Hindi summarization compared to both IndicBERT
and mT5-small. Unlike IndicBERT, IndicBART’s tokenizer handled the intricacies of the Hindi language
much more efectively, preserving vowel markers and diacritics, which are essential for capturing
the meaning of words accurately. This resulted in better tokenized representations that retained the
linguistic richness of the original text. Additionally, while both IndicBART and mT5-small are based
on encoder-decoder architectures, IndicBART is specifically fine-tuned for Indian languages and has
been pre-trained on a diverse corpus of Indian language texts. This likely gave it an advantage over
mT5-small, which, although multilingual, may not have had the same level of exposure to Indian
language data. Consequently, IndicBART demonstrated better learning capabilities and produced higher
quality summaries, making it a more efective choice for the task.</p>
      <p>We further discuss the approach of dividing large input text (news articles in this case) into smaller
chunks so that fit into the maximum input size of IndicBART (1024 tokens).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology and Algorithm</title>
      <p>IndicBART is conceptually based on the mBART25/50 model family, utilizing a sequence-to-sequence
architecture with both an encoder and a decoder. The model features six layers each for the encoder and
decoder, with a hidden size of 1024 and a feed-forward filter size of 4096. It employs 16 attention heads
to efectively capture contextual information across the input sequence. With a total of 244 million
parameters,</p>
      <p>Following are the details of our approach</p>
      <p>The model IndicBART allows an input size of a maximum of 1024 tokens. The average size of the
input in the dataset (Headline + Article) was 474 words before the tokenizer was applied, while the
maximum was 5089 words. IndicBART uses a sentence-piece tokenizer, which breaks down words into
sub-words. Essentially, this means that the length of the input tensor comprising of the tokens that
the model generates for each input article, will be even greater than the number of words. Thus, we
cannot pass the whole input string directly into the This meant that if the Tokenizer for IndicBART was
applied directly on the input row of each dataset, the tokenizer would simply truncate all the tokens
past the 1024 count. This would result in severe information loss.</p>
      <p>To counter this, the following approach was used. Initially we combined the headline and article into
a single column called "Combined" so as to give the article more context in a single data sample. Then
we executed an approach to form chunks of each sample in the "Combined" column. We then trained
the model on these chunks, created the summary for each chunk, and then recombine the generated
summary for each chunk belonging to the same article and evaluated against the target summaries.</p>
      <p>Following is the step-by-step flow of the algorithm</p>
      <p>We have implemented the above using the pandas apply function in python, and we return an array
of chunks for each data in the "Combined" column. We create an additional column named "Chunks" to
store the returned array of Chunks. We now use the explode function provided by pandas to make sure
each chunk has a common id. In this case the same id as the original data sample in the combined row.</p>
      <p>Following this, we used the Trainer API provided by hugging face to train the IndicBART model.
We trained the model for 10 epochs, with a weight decay of 0.01, and a batch size of 4. The model
generated summary for each chunk, which we further recombined with the help of the id that we
preserved in the dataset post chunking. An individual sub-summary is generated for each chunk and all
the sub-summaries associated with a chunk having common ids are recombined to generate the final
summary, which is what is used for rouge score evaluation.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimentation Details</title>
      <p>3.1. Dataset
The dataset given for this task contained pairs of articles and headlines from various newspapers. A
target summary has also been given against which validation is to be performed. The training, validation
and test datasets consisted of 10427, 1500 and 3000 rows respectively.</p>
      <p>The histograms in Figure 3 show the token count (post-tokenization) of the training dataset for the
Article (including headline) and the Summary.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>In the results, we observed a clear performance distinction between IndicBART and mT5-small when
evaluated on the Hindi summarization dataset. IndicBART (Validation Set Results shown in Table 1)
demonstrated consistently lower training and validation loss values across multiple epochs, indicating
improved convergence. The ROUGE scores for IndicBART showed a steady increase, ultimately reaching
25.25 for ROUGE-1, 18.32 for ROUGE-2, 24.87 for ROUGE-L, and 24.91 for ROUGE-Lsum by the 10th
epoch. In contrast, mT5-small (Validation Set Results shown in Table 2) yielded lower ROUGE scores,
with ROUGE-1 peaking at 11.78, ROUGE-2 at 5.10, ROUGE-L at 11.60, and ROUGE-Lsum at 11.53
after five epochs. These metrics underscore IndicBART’s ability to generate summaries with higher
relevance and coherence for this task, outperforming mT5-small, which struggled to capture suficient
contextual information in its summaries, likely due to its smaller pre-trained vocabulary and fewer
layers. IndicBART’s performance on the Test Set is documented in Table 3 and Table 4.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>In this paper we explored fine tuning the IndicBART model for Summarization by applying some
additional data preprocessing techniques like chunking to cover the large amount of text data that
overshot the input size of the model.</p>
      <p>In the future, we would like to explore the how to transform the data further so as to establish more
coherent relation between the chunked summaries. We also would explore using larger models like the
mT5 family by acquiring stronger compute environments.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT and Grammarly in order to: Grammar
and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.
Working Notes of FIRE 2023 - Forum for Information Retrieval Evaluation (FIRE-WN 2023), Goa,
India, December 15-18, 2023, volume 3681 of CEUR Workshop Proceedings, CEUR-WS.org, 2023,
pp. 724–733. URL: https://ceur-ws.org/Vol-3681/T8-1.pdf.
[4] S. Satapara, P. Mehta, S. Modha, D. Ganguly, Indian language summarization at FIRE 2023, in: D.</p>
      <p>Ganguly, S. Majumdar, B. Mitra, P. Gupta, S. Gangopadhyay, P. Majumder (Eds.), Proceedings of
the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE 2023, Panjim,
India, December 15-18, 2023, ACM, 2023, pp. 27–29. URL: https://doi.org/10.1145/3632754.3634662.
doi:10.1145/3632754.3634662.
[5] S. Satapara, P. Mehta, S. Modha, A. Hegde, S. HL, D. Ganguly, Overview of the third shared task
on indian language summarization (ilsum 2024), in: K. Ghosh, T. Mandl, P. Majumder, D. Ganguly
(Eds.), Working Notes of FIRE 2024 - Forum for Information Retrieval Evaluation, Gandhinagar,
India. December 12-15, 2024, CEUR Workshop Proceedings, CEUR-WS.org, 2024
[6] S. Satapara, P. Mehta, S. Modha, A. Hegde, S. HL, D. Ganguly, Key insights from the third ilsum
track at fire 2024, in: Proceedings of the 16th Annual Meeting of the Forum for Information
Retrieval Evaluation, FIRE 2024, Gandhiinagar, India. December 12-15, 2024, ACM, 2024.
[7] S. Satapara, P. Mehta, D. Ganguly, S. Modha, Fighting fire with fire: Adversarial prompting to
generate a misinformation detection dataset, CoRR abs/2401.04481 (2024). URL: https://doi.org/10.
48550/arXiv.2401.04481. doi:10.48550/ARXIV.2401.04481. arXiv:2401.04481
[8] N. Moratanch, S. Chitrakala, A survey on abstractive text summarization, in: 2016
International Conference on Circuit, Power and Computing Technologies (ICCPCT), 2016, pp. 1–7.
doi:10.1109/ICCPCT.2016.7530193.
[9] R. Dabre, H. Shrotriya, A. Kunchukuttan, R. Puduppully, M. M. Khapra, P. Kumar, Indicbart: A
pretrained model for natural language generation of indic languages, in: Findings of the Association
for Computational Linguistics, 2022.
[10] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M.</p>
      <p>Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger,
M. Drame, Q. Lhoest, A. M. Rush, Huggingface’s transformers: State-of-the-art natural language
processing, 2019. URL: https://arxiv.org/abs/1910.03771. doi:10.48550/ ARXIV.1910.03771.
[11] S. Sonawane, P. Kulkarni, C. Deshpande, B. Athawale, Extractive summarization using semigraph
(essg), Evolving Systems 10 (2019). doi:10.1007/s12530-018-9246-8.
[12] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep
bidirectional transformers for language understanding, 2018. URL: https://arxiv.org/abs/1810.04805.
doi:10.48550/ARXIV.1810.04805</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <article-title>Findings of the first shared task on indian language summarization (ILSUM): approaches challenges and the path ahead</article-title>
          , in: K. Ghosh,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , M. Mitra (Eds.), Working Notes of FIRE 2022 -
          <article-title>Forum for Information Retrieval Evaluation, Kolkata</article-title>
          , India, December 9-
          <issue>13</issue>
          ,
          <year>2022</year>
          , volume
          <volume>3395</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>369</fpage>
          -
          <lpage>382</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3395</volume>
          /
          <fpage>T6</fpage>
          -1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          , P. Mehta,
          <article-title>FIRE 2022 ILSUM track: Indian language summarization</article-title>
          , in: D.
          <string-name>
            <surname>Ganguly</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gangopadhyay</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Mitra</surname>
          </string-name>
          , P. Majumder (Eds.),
          <source>Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation</source>
          ,
          <string-name>
            <surname>FIRE</surname>
          </string-name>
          <year>2022</year>
          , Kolkata, India, December 9-
          <issue>13</issue>
          ,
          <year>2022</year>
          , ACM,
          <year>2022</year>
          , pp.
          <fpage>8</fpage>
          -
          <lpage>11</lpage>
          . URL: https://doi.org/10.1145/3574318.3574328. doi:
          <volume>10</volume>
          .1145/3574318. 3574328.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguly</surname>
          </string-name>
          ,
          <article-title>Key takeaways from the second shared task on indian language summarization (ILSUM 2023)</article-title>
          , in: K. Ghosh,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , M. Mitra (Eds.),
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>