<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>D. Taunk); https://www.iiit.ac.in/~vv (V. Varma)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Summarizing Indian Languages using Multilingual Transformers based Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dhaval Taunk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasudeva Varma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>International Institute of Information Technology</institution>
          ,
          <addr-line>Hyderabad, Telangana</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>With the advent of multilingual models like mBART, mT5, IndicBART etc., summarization in low resource Indian languages is getting a lot of attention now a days. But still the number of datasets is low in number. In this work, we (Team HakunaMatata) study how these multilingual models perform on the datasets which have Indian languages as source and target text while performing summarization. We experimented with IndicBART and mT5 models to perform the experiments and report the ROUGE-1, ROUGE-2, ROUGE-3 and ROUGE-4 scores as a performance metric.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Abstractive Summarization</kwd>
        <kwd>mBART</kwd>
        <kwd>mT5</kwd>
        <kwd>IndicBART</kwd>
        <kwd>ROUGE</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Automatic text summarization has a lot of potential applications in the current technological
era like summarizing news articles, research articles etc. A lot of work has already been done
in summarizing English languages text. But very little work is being done in summarizing
Indian Languages. Therefore, summarizing text in these languages apart from English has
become an essential task. India has approximately 350 million and 50 million Hindi and Gujarati
speakers respectively. So building a summarization model in these languages will play a crucial
role for this task. Recently, transformers based models like mBart[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], mT5[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and IndicBart[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
have gained a lot of attention because of their multilingual capabilities including various Indic
Languages.
      </p>
      <p>Summarization can be performed in 2 ways: extractive summarization and abstractive
summarization. In extractive summarization, a subset of sentences from the input text is taken
as output summary. While in abstractive summarization, the entire summary is generated
from scratch with the source text as input. Since text in abstractive summarization, summary
is generated from scratch, this makes it more human like generated text. But at the same
time, it becomes more dificult to perform abstractive summarization as compared to extractive
summarization.</p>
      <p>
        In this work, we aim to perform abstractive summarization on these languages as a part of
the FIRE shared task 2022 - ILSUM [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] using the dataset provided by the organizers. We used
IndicBART and mT5 models for our experiments. We also performed data augmentation and
tested the performance of the models. In the last, we report the ROUGE-1, ROUGE-2, ROUGE-3
and ROUGE-4 scores as mentioned by the shared task organizers.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Both the extractive and abstractive summarization are well explored problem in English language
context. A lot of datasets are available in English. Pubmed[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], arXiv[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], CNN/Daily Mail[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] are
to name a few.
      </p>
      <p>
        Guo et. al.,[9] extended T5[10] model to take long text as input and performed summarization
over PubMed dataset. PRIMERA[11] is also another model which uses Longformer[12] model
and achieved state of the art results on datasets like arXiv summarization data[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
MultiNews[13] and WCEP[14] datasets. Hasan et. al.,[15] introduced a multilingual dataset named
XL-Sum comprising of 44 languages. They experimented with mT5[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] model to perform
abstractive summarization and report results based upon that.
      </p>
      <p>
        Aries et. al.,[16] performed multilingual and multi document summarization by clustering
sentences into topics using a fuzzy clustering algorithm. They score each sentence based upon
the topic coverage and then they create summary using the highest scoring sentences. For cross
lingual abstractive summarization, Ladhak et. al.,[17] proposed WikiLingua, a article-summary
pairs multilingual dataset available in 18 diferent languages. They fine-tuned mBART[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] in
their experiments.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The main aim of the task is to generate summary for the articles and headline pairs 3 languages
viz. English, Hindi, Gujarati. Although news articles and headlines have been used in a number
of earlier eforts in other languages, the current dataset presents a special problem of code- and
script-mixing. Even though the item is written in an Indian language, phrases from English
are frequently used in news stories. We perform experiments using IndicBart and mT5 models
after performing some data analysis. We also found data augmentation to be a useful approach
in getting better results.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Description</title>
        <p>The dataset provided by the organizers is divided into 3 languages with 3 splits (train, val and
test) present for all these 3 languages. Table 1 shows the article count for all 3 splits across all
the languages.</p>
        <p>For training phase, we were provided with the id of the article, link to the article, heading,
summary and article text. While for the testing phase, we we given id of the article and the
corresponding article text.</p>
        <p>Since no references summaries was provided for the validation set. Therefore, while
performing our own experiment, we took a small subset from the train set as our in-house validation
set and then performed experiments over these sets. While we performed experiments on our
in-house validation set, during the validation phase, we evaluated our model on the oficial
validation set. After that, since there was limitation of 3 submission per languages, we chose
our top 3 performing experiments of each language as our final submission for test phase.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Models used</title>
        <p>This section explains the steps that we undergo to perform the experiments and also the diferent
experiments that we performed on the dataset.</p>
        <p>For our experiments, we fine-tuned models viz. IndicBART and mT5-small the details of which
is given below:
1. IndicBART: The eleven Indic languages and English are the main focus of the
multilingual, sequence-to-sequence pre-trained model known as IndicBART. The authors tested
IndicBART on two NLG tasks: extreme summarization and neural machine translation
(NMT) and demonstrated that despite being substantially smaller, models IndicBART is
competitive with huge pre-trained models like mBART50.
2. mT5: A new Common Crawl-based dataset with 101 languages was used to pre-train
the multilingual T5 model (mT5 model). For mT5, the model design and training process
closely resemble those of T5.</p>
        <p>Both the models follows 12 layer (6 layer encoder + 6 layer decoder) architecture.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Data Augmentation</title>
        <p>Apart from fine-tuning the model on actual test set, we also performed data augmentation
and found a significant improvement over the results. For data augmentation, we performed
2 experiments. One by augmenting the 3X data to the actual dataset. Another by appending
5X data to the actual dataset. We found out that the performance of the model increased with
increase in data augmentation amount.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Training Configuration</title>
        <p>We used HuggingFace API and PyTorch to fine-tune the models. We used a learning rate of
2e-5. Maximum input and output sequence length as 1024 and 100 respectively. And fine-tuned
for 5,7, 10 epochs for diferent experiments.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>This section gives a detailed overview of results containing all the experiments performed on
the validation set. While table 2, table 3 and table 4 gives results of various experiments on
validation dataset for English, Hindi and Gujarati respectively. Along with that table 5, table 6
and table 7 shows the final 3 test set results per language.</p>
      <sec id="sec-5-1">
        <title>5.1. Experiment Name</title>
        <p>5.1.1. English Experiments
This subsection defines the experiment name with their details which are mentioned in the
below mentioned tables:
1. da_en_mt5: mT5-small was finetuned in this approach along with data augmentation to
3 times of the actual english data.
2. da_en_ibart: IndicBART was finetuned in this approach along with data augmentation
to 3 times of the actual english data.
3. da5_en_ibart: IndicBART was finetuned in this approach along with data augmentation
to 5 times of the actual english data.
4. en_ibart: IndicBART was finetuned in this approach on the actual english dataset.
5. en_mt5: mt5-small was finetuned in this approach on the actual english dataset.
5.1.2. Hindi Experiments
1. da5_hi_ibart: IndicBART was finetuned in this approach along with data augmentation
to 5 times of the actual hindi data.
2. da_hi_ibart: IndicBART was finetuned in this approach along with data augmentation
to 3 times of the actual hindi data.
3. da_hi_mt5: mT5-small was finetuned in this approach along with data augmentation to
3 times of the actual hindi data.
4. hi_ibart: IndicBART was finetuned in this approach on the actual hindi dataset.
5. hi_mt5: mT5-small was finetuned in this approach on the actual hindi dataset.
5.1.3. Gujarati Experiments
1. gu_ibart: IndicBART was finetuned in this approach on the actual gujarati dataset.
2. da_gu_ibart: IndicBART was finetuned in this approach along with data augmentation
to 3 times of the actual gujarati data.
3. da5_gu_ibart: IndicBART was finetuned in this approach along with data augmentation
to 5 times of the actual gujarati data.</p>
        <p>4. gu_mt5: mT5-small was finetuned in this approach on the actual gujarati dataset.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Validation set results</title>
        <p>Below 3 tables shows results of our experiments on the validation set.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Test set results</title>
        <p>The below 3 tables shows the results of top 3 experiments per language on oficial test set.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Analysis</title>
        <p>From the above results, we can say that data augmentation is a useful step as it has shown
significant improvement of results over other experiments. Also, on comparing IndicBART
and mT5, we can say that IndicBART performed better in most of the cases than mT5 for the
summarization task. Further improvement can be made by using larger models like mbart-large
or mt5-base/mt5-large models.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this work, we presented our work for performing summarizing indian languages as part
of the Forum for Information Retrieval Evaluation, 2022 shared task. We perform various
experiments with multilingual transformer based models like IndicBART and mT5-small and
acheived significant results. For Hindi and Gujarati languages, we stood at 2 place. While for
English language, we stood at 4ℎ place. Due to computational constraints, we were not able to
use larger models like mbart-large and mt5-base which could have performed even better. We
hope this work will help future research in this direction.
Canada, 2017, pp. 1073–1083. URL: https://aclanthology.org/P17-1099. doi:10.18653/v1/
P17-1099.
[9] M. Guo, J. Ainslie, D. Uthus, S. Ontanon, J. Ni, Y.-H. Sung, Y. Yang, LongT5: Eficient
textto-text transformer for long sequences, in: Findings of the Association for Computational
Linguistics: NAACL 2022, Association for Computational Linguistics, Seattle, United States,
2022, pp. 724–736. URL: https://aclanthology.org/2022.findings-naacl.55. doi: 10.18653/
v1/2022.findings-naacl.55.
[10] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,
Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of
Machine Learning Research 21 (2020) 1–67. URL: http://jmlr.org/papers/v21/20-074.html.
[11] W. Xiao, I. Beltagy, G. Carenini, A. Cohan, PRIMERA: Pyramid-based masked sentence
pretraining for multi-document summarization, in: Proceedings of the 60th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long Papers), Association for
Computational Linguistics, Dublin, Ireland, 2022, pp. 5245–5263. URL: https://aclanthology.
org/2022.acl-long.360. doi:10.18653/v1/2022.acl-long.360.
[12] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The long-document transformer,
arXiv:2004.05150 (2020).
[13] A. Fabbri, I. Li, T. She, S. Li, D. Radev, Multi-news: A large-scale multi-document
summarization dataset and abstractive hierarchical model, in: Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics, Association for Computational
Linguistics, Florence, Italy, 2019, pp. 1074–1084. URL: https://aclanthology.org/P19-1102.
doi:10.18653/v1/P19-1102.
[14] D. Gholipour Ghalandari, C. Hokamp, N. T. Pham, J. Glover, G. Ifrim, A large-scale
multi-document summarization dataset from the Wikipedia current events portal, in:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,
Association for Computational Linguistics, Online, 2020, pp. 1302–1308. URL: https://
aclanthology.org/2020.acl-main.120. doi:10.18653/v1/2020.acl-main.120.
[15] T. Hasan, A. Bhattacharjee, M. S. Islam, K. Mubasshir, Y.-F. Li, Y.-B. Kang, M. S.
Rahman, R. Shahriyar, XL-sum: Large-scale multilingual abstractive summarization for 44
languages, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP
2021, Association for Computational Linguistics, Online, 2021, pp. 4693–4703. URL: https://
aclanthology.org/2021.findings-acl.413. doi: 10.18653/v1/2021.findings-acl.413.
[16] A. Aries, D. E. Zegour, K. W. Hidouci, AllSummarizer system at MultiLing 2015:
Multilingual single and multi-document summarization, in: Proceedings of the 16th
Annual Meeting of the Special Interest Group on Discourse and Dialogue,
Association for Computational Linguistics, Prague, Czech Republic, 2015, pp. 237–244. URL:
https://aclanthology.org/W15-4634. doi:10.18653/v1/W15-4634.
[17] F. Ladhak, E. Durmus, C. Cardie, K. McKeown, WikiLingua: A new benchmark dataset for
cross-lingual abstractive summarization, in: Findings of the Association for Computational
Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp.
4034–4048. URL: https://aclanthology.org/2020.findings-emnlp.360. doi: 10.18653/v1/
2020.findings-emnlp.360.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Edunov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <article-title>Multilingual denoising pre-training for neural machine translation</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>8</volume>
          (
          <year>2020</year>
          )
          <fpage>726</fpage>
          -
          <lpage>742</lpage>
          . URL: https://aclanthology. org/
          <year>2020</year>
          .tacl-
          <volume>1</volume>
          .47. doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00343</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Constant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Al-Rfou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Siddhant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barua</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Rafel, mT5: A massively multilingual pre-trained text-to-text transformer, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>483</fpage>
          -
          <lpage>498</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .naacl-main.
          <volume>41</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl-main.
          <volume>41</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Dabre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shrotriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kunchukuttan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Puduppully</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Khapra</surname>
          </string-name>
          , P. Kumar,
          <article-title>IndicBART: A pre-trained model for indic natural language generation, in: Findings of the Association for Computational Linguistics: ACL 2022, Association for Computational Linguistics</article-title>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>1849</fpage>
          -
          <lpage>1863</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .findings-acl.
          <volume>145</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .findings-acl.
          <volume>145</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <article-title>Findings of the first shared task on indian language summarization (ilsum): Approaches, challenges and the path ahead</article-title>
          ,
          <source>in: Working Notes of FIRE 2022 - Forum for Information Retrieval Evaluation</source>
          , Kolkata, India, December 9-
          <issue>13</issue>
          ,
          <year>2022</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Satapara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Modha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <article-title>Fire 2022 ilsum track: Indian language summarization</article-title>
          ,
          <source>in: Proceedings of the 14th Forum for Information Retrieval Evaluation</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L. G.</given-names>
            <surname>Galileo Mark</surname>
          </string-name>
          <string-name>
            <surname>Namata</surname>
          </string-name>
          , Ben London,
          <string-name>
            <given-names>B.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Query-driven active surveying for collective classification</article-title>
          , in: International Workshop on Mining and
          <article-title>Learning with Graphs</article-title>
          , Edinburgh, Scotland,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C. B.</given-names>
            <surname>Clement</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bierbaum</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. P. O'Keefe</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          <string-name>
            <surname>Alemi</surname>
          </string-name>
          ,
          <article-title>On the use of arxiv as a dataset, 2019</article-title>
          . URL: https://arxiv.org/abs/
          <year>1905</year>
          .00075. doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>1905</year>
          .
          <volume>00075</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>See</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>Get to the point: Summarization with pointer-generator networks</article-title>
          ,
          <source>in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Vancouver,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>